提交 0461d049 编写于 作者: T TeslaZhao

Merge branch 'develop' of https://github.com/paddlepaddle/Serving into compile_doc

此差异已折叠。
此差异已折叠。
# 如何使用Paddle Serving做ABTEST
(简体中文|[English](./ABTEST_IN_PADDLE_SERVING.md))
(简体中文|[English](./ABTest_EN.md))
该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。
<img src="images/abtest.png" style="zoom:33%;" />
<img src="../images/abtest.png" style="zoom:33%;" />
需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。
......@@ -24,13 +24,13 @@ pip install Shapely
````
您可以直接运行下面的命令来处理数据。
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py)
[python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。
### 启动Server端
这里采用[Docker方式](RUN_IN_DOCKER_CN.md)启动Server端服务。
这里采用[Docker方式](../RUN_IN_DOCKER_CN.md)启动Server端服务。
首先启动BOW Server,该服务启用`8000`端口:
......@@ -62,7 +62,7 @@ exit
您可以直接使用下面的命令,进行ABTEST预测。
[python abtest_client.py](../python/examples/imdb/abtest_client.py)
[python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
```python
from paddle_serving_client import Client
......
# ABTEST in Paddle Serving
([简体中文](./ABTEST_IN_PADDLE_SERVING_CN.md)|English)
([简体中文](./ABTest_CN.md)|English)
This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
<img src="images/abtest.png" style="zoom:25%;" />
<img src="../images/abtest.png" style="zoom:25%;" />
Note that: A/B Test is only applicable to RPC mode, not web mode.
......@@ -25,13 +25,13 @@ pip install Shapely
You can directly run the following command to process the data.
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py)
[python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
The Python code in the file will process the data `test_data/part-0` and write to the `processed.data` file.
### Start Server
Here, we [use docker](RUN_IN_DOCKER.md) to start the server-side service.
Here, we [use docker](../RUN_IN_DOCKER.md) to start the server-side service.
First, start the BOW server, which enables the `8000` port:
......@@ -63,7 +63,7 @@ Before running, use `pip install paddle-serving-client` to install the paddle-se
You can directly use the following command to make abtest prediction.
[python abtest_client.py](../python/examples/imdb/abtest_client.py)
[python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
[//file]:#abtest_client.py
``` python
......
# C++ Serving vs TensorFlow Serving 性能对比
# 1. 测试环境和说明
1) GPU型号:Tesla P4(7611 Mib)
2) Cuda版本:11.0
3) 模型:ResNet_v2_50
4) 为了测试异步合并batch的效果,测试数据中batch=1
5) [使用的测试代码和使用的数据集](../../examples/C++/PaddleClas/resnet_v2_50)
6) 下图中蓝色是C++ Serving,灰色为TF-Serving。
7) 折线图为QPS,数值越大表示每秒钟处理的请求数量越大,性能就越好。
8) 柱状图为平均处理时延,数值越大表示单个请求处理时间越长,性能就越差。
# 2. 同步模式
均使用同步模式,默认参数配置。
可以看出同步模型默认参数配置情况下,C++Serving QPS和平均时延指标均优于TF-Serving。
<p align="center">
<br>
<img src='../images/syn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | pd-serving | 111.336 | 89.787| tf-serving| 84.632| 118.13|
|30 |pd-serving |165.928 |180.761 |tf-serving |106.572 |281.473|
|50| pd-serving| 207.244| 241.211| tf-serving| 80.002 |624.959|
|70 |pd-serving |214.769 |325.894 |tf-serving |105.17 |665.561|
|100| pd-serving| 235.405| 424.759| tf-serving| 93.664 |1067.619|
|150 |pd-serving |239.114 |627.279 |tf-serving |86.312 |1737.848|
# 3. 异步模式
均使用异步模式,最大batch=32,异步线程数=2。
可以看出异步模式情况下,两者性能接近,但当Client端并发数达到70的时候,TF-Serving服务直接超时,而C++Serving能够正常返回结果。
同时,对比同步和异步模式可以看出,异步模式在请求batch数较小时,通过合并batch能够有效提高QPS和平均处理时延。
<p align="center">
<br>
<img src='../images/asyn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
|10| pd-serving| 130.631| 76.502| tf-serving |172.64 |57.916|
|30| pd-serving| 201.062| 149.168| tf-serving| 241.669| 124.128|
|50| pd-serving| 286.01| 174.764| tf-serving |278.744 |179.367|
|70| pd-serving| 313.58| 223.187| tf-serving| 298.241| 234.7|
|100| pd-serving| 323.369| 309.208| tf-serving| 0| ∞|
|150| pd-serving| 328.248| 456.933| tf-serving| 0| ∞|
......@@ -75,9 +75,9 @@ service ImageClassifyService {
#### 2.2.2 示例配置
关于Serving端的配置的详细信息,可以参考[Serving端配置](SERVING_CONFIGURE.md)
关于Serving端的配置的详细信息,可以参考[Serving端配置](../Serving_Configure_CN.md)
以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[设计文档](C++DESIGN_CN.md))
以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[OP介绍](OP_CN.md)[DAG介绍](DAG_CN.md))
- 配置文件示例:
......@@ -310,7 +310,7 @@ api.thrd_finalize();
api.destroy();
```
具体实现可参考paddle Serving提供的例子sdk-cpp/demo/ximage.cpp
具体实现可参考C++Serving提供的例子。sdk-cpp/demo/ximage.cpp
### 3.3 链接
......@@ -392,4 +392,4 @@ predictors {
}
}
```
关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](CLIENT_CONFIGURE.md)
关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](Client_Configure_CN.md)
# Server端的计算图
(简体中文|[English](./SERVER_DAG.md))
(简体中文|[English](DAG_EN.md))
本文档显示了Server端上计算图的概念。 如何使用PaddleServing内置运算符定义计算图。 还显示了一些顺序执行逻辑的示例。
......@@ -9,7 +9,7 @@
深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下:
<center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
<img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center>
## 如何定义节点
......@@ -18,7 +18,7 @@
PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
<center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
<img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center>
``` python
......@@ -47,10 +47,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### 包含多个输入的节点
[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。
[Paddle Serving中的集成预测](Model_Ensemble_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。
<center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
<img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center>
```python
......
# Computation Graph On Server
([简体中文](./SERVER_DAG_CN.md)|English)
([简体中文](./DAG_CN.md)|English)
This document shows the concept of computation graph on server. How to define computation graph with PaddleServing built-in operators. Examples for some sequential execution logics are shown as well.
......@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:
<center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
<img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center>
## How to define Node
......@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
<center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
<img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center>
``` python
......@@ -48,10 +48,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### Nodes with multiple inputs
An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows.
An example containing multiple input nodes is given in the [Model_Ensemble](Model_Ensemble_EN.md). A example graph and the corresponding DAG definition code is as follows.
<center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
<img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center>
```python
......
# 加密模型预测
(简体中文|[English](ENCRYPTION.md))
(简体中文|[English](Encryption_EN.md))
Padle Serving提供了模型加密预测功能,本文档显示了详细信息。
......@@ -12,7 +12,7 @@ Padle Serving提供了模型加密预测功能,本文档显示了详细信息
普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。
我们提供了一个简单的演示来加密模型。请参阅[`python/examples/encryption/encrypt.py`](../python/examples/encryption/encrypt.py)
我们提供了一个简单的演示来加密模型。请参阅[examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### 启动加密服务
......@@ -40,5 +40,4 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
### 模型加密推理示例
模型加密推理示例, 请参见[`/python/examples/encryption/`](../python/examples/encryption/)
模型加密推理示例, 请参见[examples/C++/encryption/](../../examples/C++/encryption/)
# MOEDL ENCRYPTION INFERENCE
([简体中文](ENCRYPTION_CN.md)|English)
([简体中文](Encryption_CN.md)|English)
Paddle Serving provides model encryption inference, This document shows the details.
......@@ -12,7 +12,7 @@ We use symmetric encryption algorithm to encrypt the model. Symmetric encryption
Normal model and parameters can be understood as a string, by using the encryption algorithm (parameter is your key) on them, the normal model and parameters become an encrypted one.
We provide a simple demo to encrypt the model. See the [python/examples/encryption/encrypt.py](../python/examples/encryption/encrypt.py)
We provide a simple demo to encrypt the model. See the [examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### Start Encryption Service
......@@ -40,5 +40,4 @@ Once the server gets the key, it uses the key to parse the model and starts the
### Example of Model Encryption Inference
Example of model encryption inference, See the [`/python/examples/encryption/`](../python/examples/encryption/)
Example of model encryption inference, See the [examples/C++/encryption/](../../examples/C++/encryption/)
此差异已折叠。
# Paddle Serving中的模型热加载
(简体中文|[English](HOT_LOADING_IN_SERVING.md))
(简体中文|[English](Hot_Loading_EN.md))
## 背景
......
# Hot Loading in Paddle Serving
([简体中文](HOT_LOADING_IN_SERVING_CN.md)|English)
([简体中文](Hot_Loading_CN.md)|English)
## Background
......
......@@ -14,12 +14,10 @@ BRPC-Server会尝试去JSON字符串中再去反序列化出Proto格式的数据
各种语言都提供了对ProtoBuf的支持,如果您对此比较熟悉,您也可以先将数据使用ProtoBuf序列化,再将序列化后的数据放入Http请求数据体中,然后指定Content-Type: application/proto,从而使用http/h2+protobuf二进制串访问服务。
实测随着数据量的增大,使用JSON方式的Http的数据量和反序列化的耗时会大幅度增加,推荐当您的数据量较大时,使用Http+protobuf方式,目前已经在Java和Python的Client端提供了支持。
**理论上讲,序列化/反序列化的性能从高到底排序为:protobuf > http/h2+protobuf > http**
## 示例
我们将以python/examples/fit_a_line为例,讲解如何通过Http访问Server端。
我们将以examples/C++/fit_a_line为例,讲解如何通过Http访问Server端。
### 获取模型
......
# Inference Protocols
C++ Serving基于BRPC进行服务构建,支持BRPC、GRPC、RESTful请求。请求数据为protobuf格式,详见`core/general-server/proto/general_model_service.proto`。本文介绍构建请求以及解析结果的方法。
## Tensor
Tensor可以装载多种类型的数据,是Request和Response的基础单元。Tensor的具体定义如下:
```protobuf
message Tensor {
// VarType: INT64
repeated int64 int64_data = 1;
// VarType: FP32
repeated float float_data = 2;
// VarType: INT32
repeated int32 int_data = 3;
// VarType: FP64
repeated double float64_data = 4;
// VarType: UINT32
repeated uint32 uint32_data = 5;
// VarType: BOOL
repeated bool bool_data = 6;
// (No support)VarType: COMPLEX64, 2x represents the real part, 2x+1
// represents the imaginary part
repeated float complex64_data = 7;
// (No support)VarType: COMPLEX128, 2x represents the real part, 2x+1
// represents the imaginary part
repeated double complex128_data = 8;
// VarType: STRING
repeated string data = 9;
// Element types:
// 0 => INT64
// 1 => FP32
// 2 => INT32
// 3 => FP64
// 4 => INT16
// 5 => FP16
// 6 => BF16
// 7 => UINT8
// 8 => INT8
// 9 => BOOL
// 10 => COMPLEX64
// 11 => COMPLEX128
// 20 => STRING
int32 elem_type = 10;
// Shape of the tensor, including batch dimensions.
repeated int32 shape = 11;
// Level of data(LOD), support variable length data, only for fetch tensor
// currently.
repeated int32 lod = 12;
// Correspond to the variable 'name' in the model description prototxt.
string name = 13;
// Correspond to the variable 'alias_name' in the model description prototxt.
string alias_name = 14; // get from the Model prototxt
// VarType: FP16, INT16, INT8, BF16, UINT8
bytes tensor_content = 15;
};
```
- elem_type:数据类型,当前支持FLOAT32, INT64, INT32, UINT8, INT8, FLOAT16
|elem_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:数据维度
- lod:lod信息,LoD(Level-of-Detail) Tensor是Paddle的高级特性,是对Tensor的一种扩充,用于支持更自由的数据输入。详见[LOD](../LOD_CN.md)
- name/alias_name: 名称及别名,与模型配置对应
### 构建FLOAT32数据Tensor
```C
// 原始数据
std::vector<float> float_data;
Tensor *tensor = new Tensor;
// 设置维度,可以设置多维
for (uint32_t j = 0; j < float_shape.size(); ++j) {
tensor->add_shape(float_shape[j]);
}
// 设置LOD信息
for (uint32_t j = 0; j < float_lod.size(); ++j) {
tensor->add_lod(float_lod[j]);
}
// 设置类型、名称及别名
tensor->set_elem_type(1);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
// 拷贝数据
int total_number = float_data.size();
tensor->mutable_float_data()->Resize(total_number, 0);
memcpy(tensor->mutable_float_data()->mutable_data(), float_datadata(), total_number * sizeof(float));
```
### 构建INT8数据Tensor
```C
// 原始数据
std::string string_data;
Tensor *tensor = new Tensor;
for (uint32_t j = 0; j < string_shape.size(); ++j) {
tensor->add_shape(string_shape[j]);
}
for (uint32_t j = 0; j < string_lod.size(); ++j) {
tensor->add_lod(string_lod[j]);
}
tensor->set_elem_type(8);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
tensor->set_tensor_content(string_data);
```
## Request
Request为客户端需要发送的请求数据,其以Tensor为基础数据单元,并包含了额外的请求信息。定义如下:
```protobuf
message Request {
repeated Tensor tensor = 1;
repeated string fetch_var_names = 2;
bool profile_server = 3;
uint64 log_id = 4;
};
```
- fetch_vat_names: 需要获取的输出数据名称,在GeneralResponseOP会根据该列表进行过滤.请参考模型文件serving_client_conf.prototxt中的`fetch_var`字段下的`alias_name`
- profile_server: 调试参数,打开时会输出性能信息
- log_id: 请求ID
### 构建Request
当使用BRPC或GRPC进行请求时,使用protobuf形式数据,构建方式如下:
```C
Request req;
req.set_log_id(log_id);
for (auto &name : fetch_name) {
req.add_fetch_var_names(name);
}
// 添加Tensor
Tensor *tensor = req.add_tensor();
...
```
当使用RESTful请求时,可以使用JSON形式数据,具体格式如下:
```JSON
{"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"name":"x","alias_name":"x","shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}
```
## Response
Response为服务端返回给客户端的结果,包含了Tensor数据、错误码、错误信息等。定义如下:
```protobuf
message Response {
repeated ModelOutput outputs = 1;
repeated int64 profile_time = 2;
// Error code
int32 err_no = 3;
// Error messages
string err_msg = 4;
};
message ModelOutput {
repeated Tensor tensor = 1;
string engine_name = 2;
}
```
- profile_time:当设置request->set_profile_server(true)时,会返回性能信息
- err_no:错误码,详见`core/predictor/common/constant.h`
- err_msg:错误信息,详见`core/predictor/common/constant.h`
- engine_name:输出节点名称
|err_no|err_msg|
|---------|----|
|0|OK|
|-5000|"Paddle Serving Framework Internal Error."|
|-5001|"Paddle Serving Memory Alloc Error."|
|-5002|"Paddle Serving Array Overflow Error."|
|-5100|"Paddle Serving Op Inference Error."|
### 读取Response数据
```C
uint32_t model_num = res.outputs_size();
for (uint32_t m_idx = 0; m_idx < model_num; ++m_idx) {
std::string engine_name = output.engine_name();
int idx = 0;
// 读取tensor维度
int shape_size = output.tensor(idx).shape_size();
for (int i = 0; i < shape_size; ++i) {
shape[i] = output.tensor(idx).shape(i);
}
// 读取LOD信息
int lod_size = output.tensor(idx).lod_size();
if (lod_size > 0) {
lod.resize(lod_size);
for (int i = 0; i < lod_size; ++i) {
lod[i] = output.tensor(idx).lod(i);
}
}
// 读取float数据
int size = output.tensor(idx).float_data_size();
float_data = std::vector<float>(
output.tensor(idx).float_data().begin(),
output.tensor(idx).float_data().begin() + size);
// 读取int8数据
string_data = output.tensor(idx).tensor_content();
}
```
\ No newline at end of file
# C++ Serving 简要介绍
## 适用场景
C++ Serving主打性能,如果您想搭建企业级的高性能线上推理服务,对高并发、低延时有一定的要求。C++ Serving框架可能会更适合您。目前无论是使用同步/异步模型,[C++ Serving与TensorFlow Serving性能对比](Benchmark_CN.md)均有优势。
C++ Serving网络框架使用brpc,核心执行引擎是基于C/C++编写,并且提供强大的工业级应用能力,包括模型热加载、模型加密部署、A/B Test、多模型组合、同步/异步模式、支持多语言多协议Client等功能。
## 1.网络框架(BRPC)
C++ Serving采用[brpc框架](https://github.com/apache/incubator-brpc)进行Client/Server端的通信。brpc是百度开源的一款PRC网络框架,具有高并发、低延时等特点,已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务,稳定可靠。与gRPC网络框架相比,具有更低的延时,更高的并发性能,且底层支持<mark>**brpc/grpc/http+json/http+proto**</mark>等多种协议;缺点是跨操作系统平台能力不足。详细的框架性能开销见[C++ Serving框架性能测试](Frame_Performance_CN.md)
## 2.核心执行引擎
C++ Serving的核心执行引擎是一个有向无环图(也称作[DAG图](DAG_CN.md)),DAG图中的每个节点(在PaddleServing中,借用模型中operator算子的概念,将DAG图中的节点也称为[OP](OP_CN.md))代表预估服务的一个环节,DAG图支持多个OP按照串并联的方式进行组合,从而实现在一个服务中完成多个模型的预测整合最终产出结果。整个框架原理如下图所示,可分为Client Side 和 Server Side。
<p align="center">
<br>
<img src='../images/design_doc.png'">
<br>
<p>
### 2.1 Client Side
如图所示,Client端通过Pybind API接口将Request请求,按照ProtoBuf协议进行序列化后,经由BRPC网络框架Client端发送给Server端。此时,Client端等待Server端的返回数据并反序列化为正常的数据,之后将结果返给Client调用方。
### 2.2 Server Side
Server端接收到序列化的Request请求后,反序列化正常数据,进入图执行引擎,按照定义好的DAG图结构,执行每个OP环节的操作(每个OP环节的处理由用户定义,即可以只是单纯的数据处理,也可以是调用预测引擎用不同的模型对输入数据进行预测),当DAG图中所有OP环节均执行完成后,将结果数据序列化后返回给Client端。
### 2.3 通信数据格式ProtoBuf
Protocol Buffers(简称Protobuf) ,是Google出品的序列化框架,与开发语言无关,和平台无关,具有良好的可扩展性。Protobuf和所有的序列化框架一样,都可以用于数据存储、通讯协议。Protobuf支持生成代码的语言包括Java、Python、C++、Go、JavaNano、Ruby、C#。Portobuf的序列化的结果体积要比XML、JSON小很多,速度比XML、JSON快很多。
在C++ Serving中定义了Client Side 和 Server Side之间通信的ProtoBuf,详细的字段的介绍见《[C++ Serving ProtoBuf简介](Inference_Protocols_CN.md)》。
## 3.Server端特性
### 3.1 启动Server端
Server端的核心是一个由项目代码编译产生的名称为serving的二进制可执行文件,启动serving时需要用户指定一些参数(<mark>**例如,网络IP和Port端口、brpc线程数、使用哪个显卡、模型文件路径、模型是否开启trt、XPU推理、模型精度设置等等**</mark>),有些参数是通过命令行直接传入的,还有一些是写在指定的配置文件中配置文件中。
为了方便用户快速的启动C++ Serving的Server端,除了用户自行修改配置文件并通过命令行传参运行serving二进制可执行文件以外,我们也提供了另外一种通过python脚本启动的方式。python脚本启动本质上仍是运行serving二进制可执行文件,但python脚本中会自动完成两件事:1、配置文件的生成;2、根据需要配置的参数,生成命令行,通过命令行的方式,传入参数信息并运行serving二进制可执行文件。
更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
### 3.2 同步/异步模式
同步模式比较简单直接,适用于模型预测时间短,单个Request请求的batch已经比较大的情况。
同步模型下,Server端线程数N = 模型预测引擎数N = 同时处理Request请求数N,超发的Request请求需要等待当前线程处理结束后才能得到响应和处理。
<p align="center">
<img src='../images/syn_mode.png' width = "350" height = "300">
<p>
异步模型主要适用于模型支持多batch(最大batch数M可通过配置选项指定),单个Request请求的batch较小(batch << M),单次预测时间较长的情况。
异步模型下,Server端N个线程只负责接收Request请求,实际调用预测引擎是在异步框架的线程池中,异步框架的线程数可以由配置选项来指定。为了方便理解,我们假设每个Request请求的batch均为1,此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n),调用1次预测引擎,得到1个Response(batch = n),再将其对应拆分为n个Response作为返回结果。
<p align="center">
<img src='../images/asyn_mode.png'">
<p>
更多关于模式参数配置以及性能调优的介绍见《[C++ Serving性能调优](Performance_Tuning_CN.md)》。
### 3.3 多模型组合
当用户需要多个模型组合处理结果来作为一个服务接口对外暴露时,通常的解决办法是搭建内外两层服务,内层服务负责跑模型预测,外层服务负责串联和前后处理。当传输的数据量不大时,这样做的性能开销并不大,但当输出的数据量较大时,因为网络传输而带来的性能开销不容忽视(实测单次传输40MB数据时,RPC耗时为160-170ms)。
<p align="center">
<br>
<img src='../images/multi_model.png'>
<br>
<p>
C++ Serving框架支持[自定义DAG图](Model_Ensemble_CN.md)的方式来表示多模型之间串并联组合关系,也支持用户[使用C++开发自定义OP节点](OP_CN.md)。相比于使用内外两层服务来提供多模型组合处理的方式,由于节省了一次RPC网络传输的开销,把多模型在一个服务中处理性能上会有一定的提升,尤其当RPC通信传输的数据量较大时。
### 3.4 模型管理与热加载
C++ Serving的引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。C++ Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[C++ Serving中的模型热加载](Hot_Loading_CN.md)》。
### 3.5 模型加解密
C++ Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[C++ Serving加密模型预测](Encryption_CN.md)》。
## 4.Client端特性
### 4.1 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTest_CN.md)》。
<p align="center">
<br>
<img src='../images/abtest.png' width = "345" height = "230">
<br>
<p>
### 4.2 多语言多协议Client
BRPC网络框架支持[多种底层通信协议](#1网络框架(BRPC)),即使用目前的C++ Serving框架的Server端,各种语言的Client端,甚至使用curl的方式,只要按照上述协议(具体支持的协议见[brpc官网](https://github.com/apache/incubator-brpc))封装数据并发送,Server端就能够接收、处理和返回结果。
对于支持的各种协议我们提供了部分的Client SDK示例供用户参考和使用,用户也可以根据自己的需求去开发新的Client SDK,也欢迎用户添加其他语言/协议(例如GRPC-Go、GRPC-C++ HTTP2-Go、HTTP2-Java等)Client SDK到我们的仓库供其他开发者借鉴和参考。
| 通信协议 | 速度 | 是否支持 | 是否提供Client SDK |
|-------------|-----|---------|-------------------|
| BRPC | 最快 | 支持 | [C++](../../core/general-client/README_CN.md)[Python(Pybind方式)](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP2+Proto | 快 | 支持 | coming soon |
| GRPC | 快 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Proto | 一般 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Json | 慢 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md)[Curl](Http_Service_CN.md) |
# Paddle Serving中的集成预测
(简体中文|[English](MODEL_ENSEMBLE_IN_PADDLE_SERVING.md))
(简体中文|[English](Model_Ensemble_EN.md))
在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。
......@@ -14,7 +14,7 @@
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
样例中用到的代码保存在`python/examples/imdb`路径下:
样例中用到的代码保存在`examples/C++/imdb`路径下:
```shell
.
......
# Model Ensemble in Paddle Serving
([简体中文](MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)|English)
([简体中文](Model_Ensemble_CN.md)|English)
In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature.
......@@ -14,7 +14,7 @@ In this example (see the figure below), the server side predict the bow and CNN
It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.
The code used in the example is saved in the `python/examples/imdb` path:
The code used in the example is saved in the `examples/C++/imdb` path:
```shell
.
......
# 如何开发一个新的General Op?
(简体中文|[English](./NEW_OPERATOR.md))
(简体中文|[English](OP_EN.md))
在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
......
# How to write an general operator?
([简体中文](./NEW_OPERATOR_CN.md)|English)
([简体中文](OP_CN.md)|English)
In this document, we mainly focus on how to develop a new server side operator for PaddleServing. Before we start to write a new operator, let's look at some sample code to get the basic idea of writing a new operator for server. We assume you have known the basic computation logic on server side of PaddleServing, please reference to []() if you do not know much about it. The following code can be visited at `core/general-server/op` of Serving repo.
......
# C++ Serving性能分析与优化
# 1.背景知识介绍
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md。
2) 关于C++ Serving框架本身的性能分析和介绍,请参考[C++ Serving框架性能测试](Frame_Performance_CN.md)
3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解,例如,您使用CPU还是GPU进行预测;是否可以开启TRT进行加速;你的机器CPU是多少core的;您的业务包含几个模型;每个模型的输入和输出需要做些什么处理;您业务的最大线上流量是多少;您的模型支持的最大输入batch是多少等等.
# 2.Server线程数
首先,Server端线程数N并不是越大越好。众所周知,线程的切换涉及到用户空间和内核空间的切换,有一定的开销,当您的core数=1,而线程数为100000时,线程的频繁切换将带来不可忽视的性能开销。
在BRPC框架中,用户态协程worker数M >> 线程数N,用户态协程worker会工作在任意一个线程中,当RPC网络传输IO操作让出CPU资源时,BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以,极端情况下,若您的代码中除RPC通信外,没有阻塞线程的任何IO或网络操作,您的线程数完全可以 == 机器core数量,您不必担心N个线程都在进行RPC网络IO,而导致CPU利用率不高的问题。
Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑:
## 2.1 最大并发请求量M
根据最大并发请求量来设置Server端线程数N,根据[C++ Serving框架性能测试](Frame_Performance_CN.md)中的数据来看,此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>,此时平均处理时延最小。
这也很容易理解,举个极端的例子,如果您每次只有1个请求,那此时Server端线程数设置1是最合理的,因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数,必然就带来了线程切换的开销。
## 2.2 机器core数量C
根据机器core数量来设置Server端线程数N,众所周知,线程是CPU core调度执行的最小单元,若要在一个进程内充分使用所有的core,<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定,一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
## 2.3 模型预测时间长短T
当您使用CPU进行预测时,预测阶段的计算是使用CPU完成的,此时,请参考前两者来进行设置线程数。
当您使用GPU进行预测时,情况有些不同,此时预测阶段的计算是由GPU完成的,此时CPU资源是空闲的,而预测操作是阻塞该线程的,类似于Sleep操作,此时若您的线程数==机器core数量,将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说,当模型预测时间较短时(<10ms),Server端线程数不宜过多(线程数=1~10倍core数量),否则线程切换带来的开销不可忽视。当模型预测时间较长时,Server端线程数应稍大一些(线程数=4~200倍core数量)。
# 3.异步模式
<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时,采用异步模式的收益是明显的。
异步模型的原理是将模型预测阶段与RPC线程脱离,模型单独开辟一个线程数可指定的线程池,RPC收到Request后将请求数据放入模型的线程池中的Task队列中,线程池中的线程从Task中取出数据合并Batch后进行预测,从而提升QPS,更多详细的介绍见[C++Serving功能简介](Introduction_CN.md),同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](Benchmark_CN.md),在上述测试的条件下,异步模型比同步模式快百分50%。
异步模式的开启有以下两种方式。
## 3.1 Python命令辅助启动C++Server
`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式,其中2表示的是该模型异步线程池中的线程数为2,该数值默认值为0,此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入,此参数只有在异步模型开启的状态下,才有效。
## 3.2 命令行+配置文件启动C++Server
此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
# 4.多模型组合
<mark>**您的业务中需要调用多个模型进行预测**</mark>时,如果您追求极致的性能,您可以考虑使用C++Serving[自定义OP](OP_CN.md)[自定义DAG图](DAG_CN.md)的方式来实现上述需求。
## 4.1 优点
由于在一个服务中做模型的组合,节省了网络IO的时间和序列化反序列化的时间,尤其当数据量比较大时,收益十分明显(实测单次传输40MB数据时,RPC耗时为160-170ms)。
## 4.2 缺点
1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
2) 若多个模型之间需要前后处理,您也需要使用C++在OP之间去编写这部分代码。
3) 需要重新编译Server端代码。
## 4.3 示例
请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)`C++ OCR Service服务章节`[Paddle Serving中的集成预测](Model_Ensemble_CN.md)中的例子。
# 如何编译PaddleServing
(简体中文|[English](./COMPILE.md))
(简体中文|[English](./Compile_EN.md))
## 编译环境设置
......
# How to compile PaddleServing
([简体中文](./COMPILE_CN.md)|English)
([简体中文](./Compile_CN.md)|English)
## Compilation environment requirements
......@@ -23,7 +23,7 @@
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md).
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](Docker_Images_EN.md).
## Get Code
......@@ -159,8 +159,7 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes).
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](Compile_EN.md#Notes).
## Compile Client
......
......@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu
1. Build and test
Users can build Paddle Serving natively on Linux, see the [BUILD steps](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md).
Users can build Paddle Serving natively on Linux, see the [BUILD steps](Compile_EN.md).
1. Keep pulling
......
# 稀疏参数索引服务Cube单机版使用指南
(简体中文|[English](./CUBE_LOCAL.md))
(简体中文|[English](./Cube_Local_EN.md))
## 引言
在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](CUBE_LOCAL_CN.md)(正在建设中)。-->
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](Cube_Local_CN.md)(正在建设中)。-->
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./CUBE_QUANT_CN.md)
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./Cube_Quant_CN.md)
## 示例
......
# Cube: Sparse Parameter Indexing Service (Local Mode)
([简体中文](./CUBE_LOCAL_CN.md)|English)
([简体中文](./Cube_Local_CN.md)|English)
## Overview
There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services.
The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos.
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) after reading this document (still developing).-->
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) after reading this document (still developing).-->
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md)
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md)
## Example
in directory python/example/criteo_ctr_with_cube, run
......
# Cube稀疏参数索引量化存储使用指南
(简体中文|[English](./CUBE_QUANT.md))
(简体中文|[English](./Cube_Quant_EN.md))
## 总体概览
......@@ -9,7 +9,7 @@
## 前序要求
请先读取 [稀疏参数索引服务Cube单机版使用指南](./CUBE_LOCAL_CN.md)
请先读取 [稀疏参数索引服务Cube单机版使用指南](./Cube_Local_CN.md)
## 组件介绍
......
# Quantization Storage on Cube Sparse Parameter Indexing
([简体中文](./CUBE_QUANT_CN.md)|English)
([简体中文](./Cube_Quant_CN.md)|English)
## Overview
......@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati
## Precondition
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./CUBE_LOCAL_CN.md)
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./Cube_Local_EN.md)
## Components
......
......@@ -13,7 +13,7 @@
### 预备知识
- 需要会编译Paddle Serving,参见[编译文档](./COMPILE.md)
- 需要会编译Paddle Serving,参见[编译文档](./Compile_EN.md)
### 用法
......
# Docker 镜像
(简体中文|[English](DOCKER_IMAGES.md))
(简体中文|[English](Docker_Images_EN.md))
该文档维护了 Paddle Serving 提供的镜像列表。
......
# Docker Images
([简体中文](DOCKER_IMAGES_CN.md)|English)
([简体中文](Docker_Images_CN.md)|English)
This document maintains a list of docker images provided by Paddle Serving.
......
......@@ -142,7 +142,7 @@ make: *** [all] Error 2
#### Q:使用过程中出现CXXABI错误。
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./RUN_IN_DOCKER_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./Run_In_Docker_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
```bash
wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz
......@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
(1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](DOCKER_IMAGES_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
......@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is:
#### Q: 目前Paddle Serving支持哪些镜像环境?
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/DOCKER_IMAGES.md)
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
#### Q: python编译的GCC版本与serving的版本不匹配
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/RUN_IN_DOCKER.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Run_In_Docker_CN.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
#### Q: paddle-serving是否支持本地离线安装
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md)提前准备安装好
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好
#### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。
......@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"])
#### Q: 如何使用多语言客户端
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/JAVA_SDK_CN.md)
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md)
#### Q: 如何在Windows下使用Paddle Serving
......
# 如何在Paddle Serving使用Go Client
(简体中文|[English](./IMDB_GO_CLIENT.md))
(简体中文|[English](./Imdb_GO_Client_EN.md))
本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。
......
# How to use Go Client of Paddle Serving
([简体中文](./IMDB_GO_CLIENT_CN.md)|English)
([简体中文](./Imdb_GO_Client_CN.md)|English)
This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset.
......
# 使用Docker安装Paddle Serving
(简体中文|[English](./Install_EN.md))
**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](Run_In_Docker_CN.md)。更多镜像请查看[Docker镜像列表](Docker_Images_CN.md)
**提示**:目前paddlepaddle 2.1版本的默认GPU环境是Cuda 10.2,因此GPU Docker的示例代码以Cuda 10.2为准。镜像和pip安装包也提供了其余GPU环境,用户如果使用其他环境,需要仔细甄别并选择合适的版本。
**提示**:本项目仅支持Python3.6/3.7/3.8,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。
```
# 启动 CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# 启动 GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
安装所需的pip依赖
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.6.2
pip3 install paddle-serving-server==0.6.2 # CPU
pip3 install paddle-serving-app==0.6.2
pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7
# 其他GPU环境需要确认环境再选择执行哪一条
pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7
```
您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。
如果需要使用develop分支编译的安装包,请从[最新安装包列表](Latest_Packages_CN.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](Compile_CN.md)
paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。
**最新的0.6.2的版本,已经不支持Cuda 9.0和Cuda 10.0,Python已不支持2.7和3.5。**
推荐安装2.1.0及以上版本的paddle
```
# CPU环境请执行
pip3 install paddlepaddle==2.1.0
# GPU Cuda10.2环境请执行
pip3 install paddlepaddle-gpu==2.1.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`对应的url,复制下来并执行
```
pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.1.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本
如果是其他环境和Python版本,请在表格中找到对应的链接并用pip安装。
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
# Install Paddle Serving with Docker
([简体中文](Install_CN.md)|English)
We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](Run_In_Docker_EN.md). See the [document](Docker_Images_EN.md) for more docker images.
**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version.
**Attention:** the following so-called 'python' or 'pip' stands for one of Python 3.6/3.7/3.8.
```
# Run CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# Run GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
install python dependencies
```
cd Serving
pip install -r python/requirements.txt
```
```shell
pip install paddle-serving-client==0.6.0
pip install paddle-serving-server==0.6.0 # CPU
pip install paddle-serving-app==0.6.0
pip install paddle-serving-server-gpu==0.6.0.post102 #GPU with CUDA10.2 + TensorRT7
# DO NOT RUN ALL COMMANDS! check your GPU env and select the right one
pip install paddle-serving-server-gpu==0.6.0.post101 # GPU with CUDA10.1 + TensorRT6
pip install paddle-serving-server-gpu==0.6.0.post11 # GPU with CUDA10.1 + TensorRT7
You may need to use a domestic mirror source (in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download.
If you need install modules compiled with develop branch, please download packages from [latest packages list](Latest_Package_CN.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](Compile_EN.md)
Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10.
Packages of paddle-serving-client and paddle-serving-app support Linux and Windows, but paddle-serving-client only support python3.6/3.7/3.8.
**For latest version, Cuda 9.0 or Cuda 10.0 are no longer supported, Python2.7/3.5 is no longer supported.**
Recommended to install paddle >= 2.1.0
```
# CPU users, please run
pip install paddlepaddle==2.1.0
# GPU Cuda10.2 please run
pip install paddlepaddle-gpu==2.1.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the url link of the corresponding GPU environment and install it. For example, for Python3.6 users of Cuda 10.1, please select `cp36-cp36m` and
The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it and run
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`.
If it is other environment and Python version, please find the corresponding link in the table and install it with pip.
For **Windows Users**, please read the document [Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
<h2 align="center">Quick Start Example</h2>
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
# Paddle Serving Client Java SDK
(简体中文|[English](JAVA_SDK.md))
(简体中文|[English](Java_SDK_EN.md))
Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。
......
# Paddle Serving Client Java SDK
([简体中文](JAVA_SDK_CN.md)|English)
([简体中文](Java_SDK_CN.md)|English)
Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK.
......
# Lod字段说明
(简体中文|[English](LOD.md))
(简体中文|[English](LOD_EN.md))
## 概念
......
# Paddle Serving低精度部署
(简体中文|[English](./LOW_PRECISION_DEPLOYMENT.md))
## Paddle Serving低精度部署
(简体中文|[English](./Low_Precision_EN.md))
低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。
## 通过PaddleSlim量化生成低精度模型
### 通过PaddleSlim量化生成低精度模型
详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html)
## 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
### 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。
```
wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz
......@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1))
```
## 参考文档
### 参考文档
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Low-Precision Deployment for Paddle Serving
(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md))
## Low-Precision Deployment for Paddle Serving
(English|[简体中文](./Low_Precision_CN.md))
Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
## Obtain the quantized model through PaddleSlim tool
### Obtain the quantized model through PaddleSlim tool
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
```
......@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
```
## Reference
### Reference
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
此差异已折叠。
# Model Zoo
(English|[简体中文](./Model_Zoo_CN.md)
This page lists model archives that are pre-trained and pre-packaged, ready to be served for inference with PaddleServing.
To propose a model for inclusion, please submit [pull request](https://github.com/PaddlePaddle/Serving/pulls)
Special thanks to the [Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain) and [PaddleHub](https://www.paddlepaddle.org.cn/hub) whose Model Zoo and Model Examples were used in generating these model archives
| Model | Type | Framework | Download |
| --- | --- | --- | ---- |
| resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
| bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
| blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
| cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
| fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
| ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
- Refer [example](../examples) for more details on above models.
- Refer [wholechain](https://www.paddlepaddle.org.cn/wholechain) for more pre-trained models supported by PaddleServing
# Pipeline Serving
(简体中文|[English](PIPELINE_SERVING.md))
- [架构设计](PIPELINE_SERVING_CN.md#1架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化)
(简体中文|[English](Pipeline_Design_EN.md))
- [架构设计](Pipeline_Design_CN.md#1架构设计)
- [详细设计](Pipeline_Design_CN.md#2详细设计)
- [典型示例](Pipeline_Design_CN.md#3典型示例)
- [高阶用法](Pipeline_Design_CN.md#4高阶用法)
- [日志追踪](Pipeline_Design_CN.md#5日志追踪)
- [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
......
# Pipeline Serving
([简体中文](PIPELINE_SERVING_CN.md)|English)
- [Architecture Design](PIPELINE_SERVING.md#1architecture-design)
- [Detailed Design](PIPELINE_SERVING.md#2detailed-design)
- [Classic Examples](PIPELINE_SERVING.md#3classic-examples)
- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages)
- [Log Tracing](PIPELINE_SERVING.md#5log-tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization)
([简体中文](Pipeline_Design_CN.md)|English)
- [Architecture Design](Pipeline_Design_EN.md#1architecture-design)
- [Detailed Design](Pipeline_Design_EN.md#2detailed-design)
- [Classic Examples](Pipeline_Design_EN.md#3classic-examples)
- [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages)
- [Log Tracing](Pipeline_Design_EN.md#5log-tracing)
- [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
......
## Paddle Serving 快速开始示例
([English](./Quick_Start_EN.md)|简体中文)
这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。
<h3 align="center">波士顿房价预测</h3>
进入到Serving的git目录下,进入到`fit_a_line`例子
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
<h3 align="center">RPC服务</h3>
用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 异步模型的说明
异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### 当您的一个服务包含两个模型部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8
</center>
``` python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict``fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"``"price"`
<h3 align="center">HTTP服务</h3>
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`python/examples/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
客户端输入
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
返回结果
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline服务</h3>
Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](python/examples/pipeline/ocr),在目录`python/examples/pipeline/ocr`
我们先获取两个模型
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
然后启动服务端程序,将两个串联的模型作为一个整体的服务。
```
python3 web_service.py
```
最终使用http的方式请求
```
python3 pipeline_http_client.py
```
也支持rpc的方式
```
python3 pipeline_rpc_client.py
```
输出
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
## Paddle Serving Quick Start Examples
(English|[简体中文](./Quick_Start_CN.md))
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
### Boston House Price Prediction model
get into the Serving git directory, and change dir to `fit_a_line`
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving provides HTTP and RPC based service for users to access
### RPC service
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `4` | Concurrency of current service |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str | `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Only for deployment with TensorRT |
</center>
```python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
import numpy as np
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training.
### WEB service
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `python/examples/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
for client side,
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
the response is
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline Service</h3>
Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](./python/examples/pipeline/ocr).
we get two models
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
then we start server side, launch two models as one standalone web service
```
python3 web_service.py
```
http request
```
python3 pipeline_http_client.py
```
grpc request
```
python3 pipeline_rpc_client.py
```
output
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
\ No newline at end of file
# 如何在Docker中运行PaddleServing
(简体中文|[English](RUN_IN_DOCKER.md))
(简体中文|[English](Run_In_Docker_EN.md))
Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。
......@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker)
### 获取镜像
参考[该文档](DOCKER_IMAGES_CN.md)获取镜像:
参考[该文档](Docker_Images_CN.md)获取镜像:
以CPU编译镜像为例
......@@ -59,9 +59,9 @@ docker exec -it test bash
### 安装PaddleServing
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](LATEST_PACKAGES.md)
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](Latest_Packages_CN.md)
## 注意事项
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](COMPILE.md)
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](Compile_CN.md)
- 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。
# How to run PaddleServing in Docker
([简体中文](RUN_IN_DOCKER_CN.md)|English)
([简体中文](Run_In_Docker_CN.md)|English)
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
......@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d
### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image:
Refer to [this document](Docker_Images_EN.md) for a docker image:
```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel
......@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe
### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
......@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md)
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md).
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
# Paddle Serving使用百度昆仑芯片部署
(简体中文|[English](./BAIDU_KUNLUN_XPU_SERVING.md))
## Paddle Serving使用百度昆仑芯片部署
(简体中文|[English](./Run_On_XPU_EN.md))
Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器(如飞腾 FT-2000+/64), 或者百度昆仑芯片和Intel CPU服务器,上进行部署,后续完善对其他异构硬件服务器部署能力。
# 编译、安装
基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
## 编译
## 编译、安装
基本环境配置可参考[该文档](Compile_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
### 编译
* 编译server部分
```
cd Serving
......@@ -50,23 +50,23 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10
```
## 安装wheel包
### 安装wheel包
以上编译步骤完成后,会在各自编译目录$build_dir/python/dist生成whl包,分别安装即可。例如server步骤,会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
# 请求参数说明
## 请求参数说明
为了支持arm+xpu服务部署,使用Paddle-Lite加速能力,请求时需使用以下参数。
| 参数 | 参数说明 | 备注 |
| :------- | :-------------------------- | :--------------------------------------------------------------- |
| use_lite | 使用Paddle-Lite Engine | 使用Paddle-Lite cpu预测能力 |
| use_xpu | 使用Baidu Kunlun进行预测 | 该选项需要与use_lite配合使用 |
| ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# 部署使用示例
## 下载模型
## 部署使用示例
### 下载模型
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz
```
## 启动rpc服务
### 启动rpc服务
主要有三种启动配置:
* 使用cpu+xpu部署,使用Paddle-Lite xpu优化加速能力;
* 单独使用cpu部署,使用Paddle-Lite优化加速能力;
......@@ -86,7 +86,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --po
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
```
## client调用
### client调用
```
from paddle_serving_client import Client
import numpy as np
......@@ -98,9 +98,9 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
# 其他说明
## 其他说明
## 模型实例及说明
### 模型实例及说明
以下提供部分样例,其他模型可参照进行修改。
| 示例名称 | 示例链接 |
| :--------- | :---------------------------------------------------------- |
......@@ -108,6 +108,7 @@ print(fetch_map)
| resnet | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) |
注:支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异,可能存在不支持的情况,部署使用存在问题时,欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues),我们会实时跟进。
## 昆仑芯片支持相关参考资料
### 昆仑芯片支持相关参考资料
* [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# Paddle Serving Using Baidu Kunlun Chips
(English|[简体中文](./BAIDU_KUNLUN_XPU_SERVING_CN.md))
## Paddle Serving Using Baidu Kunlun Chips
(English|[简体中文](./Run_On_XPU_CN.md))
Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips
(such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve
the deployment capability on various heterogeneous hardware servers in the future.
# Compilation and installation
Refer to [compile](COMPILE.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
## Compilatiton
## Compilation and installation
Refer to [compile](Compile.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
### Compilatiton
* Compile the Serving Server
```
cd Serving
......@@ -52,11 +53,11 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10
```
## Install the wheel package
### Install the wheel package
After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories.
For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package.
# Request parameters description
## Request parameters description
In order to deploy serving
service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment.
| param | param description | about |
......@@ -64,13 +65,13 @@ In order to deploy serving
| use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite |
| use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option |
| ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# Deplyment examples
## Download the model
## Deplyment examples
### Download the model
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz
```
## Start RPC service
### Start RPC service
There are mainly three deployment methods:
* deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu;
* deploy on the cpu server standalone with Paddle-Lite;
......@@ -90,7 +91,7 @@ Start the rpc service, deploying on cpu server.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
```
##
###
```
from paddle_serving_client import Client
import numpy as np
......@@ -102,8 +103,8 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
# Others
## Model example and explanation
## Others
### Model example and explanation
Some examples are provided below, and other models can be modifed with reference to these examples.
| sample name | sample links |
......@@ -113,6 +114,6 @@ Some examples are provided below, and other models can be modifed with reference
Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time.
## Kunlun chip related reference materials
### Kunlun chip related reference materials
* [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# Serving Side Configuration
Paddle Serving配置文件格式采用明文格式的protobuf文件,配置文件的每个字段都需要事先在configure/proto/目录下相关.proto定义中定义好,才能被protobuf读取和解析到。
Serving端的所有配置均在configure/proto/server_configure.proto文件中。
## 1. service.prototxt
Serving端service 配置的入口是service.prototxt,用于配置Paddle Serving实例挂载的service列表。他的protobuf格式可参考`configure/server_configure.protobuf``InferServiceConf`类型。(至于具体的磁盘文件路径可通过--inferservice_path与--inferservice_file 命令行选项修改),样例如下:
```JSON
port: 8010
services {
name: "ImageClassifyService"
workflows: "workflow1"
}
```
其中
- port: 该字段标明本机serving实例启动的监听端口。默认为8010。还可以通过--port=8010命令行参数指定。
- services: 可以配置多个services。Paddle Serving被设计为单个Serving实例可以同时承载多个预测服务,服务间通过service name进行区分。例如以下代码配置2个预测服务:
```JSON
port: 8010
services {
name: "ImageClassifyService"
workflows: "workflow1"
}
services {
name: "BuiltinEchoService"
workflows: "workflow2"
}
```
- service.name: 请填写serving/proto/xx.proto文件的service名称,例如,在serving/proto/image_class.proto中,service名称如下声明:
```JSON
service ImageClassifyService {
rpc inference(Request) returns (Response);
rpc debug(Request) returns (Response);
option (pds.options).generate_impl = true;
};
```
则service name就是`ImageClassifyService`
- service.workflows: 用于指定该service下所配的workflow列表。可以配置多个workflow。在本例中,为`ImageClassifyService`配置了一个workflow:`workflow1``workflow1`的具体定义在workflow.prototxt
## 2. workflow.prototxt
workflow.prototxt用来描述每一个具体的workflow,他的protobuf格式可参考`configure/server_configure.protobuf``Workflow`类型。具体的磁盘文件路径可通过`--workflow_path``--workflow_file`指定。一个例子如下:
```JSON
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "image_reader_op"
type: "ReaderOp"
}
nodes {
name: "image_classify_op"
type: "ClassifyOp"
dependencies {
name: "image_reader_op"
mode: "RO"
}
}
nodes {
name: "write_json_op"
type: "WriteJsonOp"
dependencies {
name: "image_classify_op"
mode: "RO"
}
}
}
workflows {
name: "workflow2"
workflow_type: "Sequence"
nodes {
name: "echo_op"
type: "CommonEchoOp"
}
}
```
以上样例配置了2个workflow:`workflow1``workflow2`。以`workflow1`为例:
- name: workflow名称,用于从service.prototxt索引到具体的workflow
- workflow_type: 可选"Sequence", "Parallel",表示本workflow下节点所代表的OP是否可并行。**当前只支持Sequence类型,如配置了Parallel类型,则该workflow不会被执行**
- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 随意,建议取一个能代表当前node所执行OP的类
- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
- node.dependencies.mode: RO-Read Only, RW-Read Write
# 3. resource.prototxt
Serving端resource配置的入口是resource.prototxt,用于配置模型信息。它的protobuf格式参考`configure/proto/server_configure.proto`的ResourceConf。具体的磁盘文件路径可用`--resource_path``--resource_file`指定。样例如下:
```JSON
model_toolkit_path: "./conf"
model_toolkit_file: "model_toolkit.prototxt"
cube_config_file: "./conf/cube.conf"
```
其中:
- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- cube_config_file: 用来指定cube配置文件所在路径与文件名
Cube是Paddle Serving中用于大规模稀疏参数的组件。
# 4. model_toolkit.prototxt
用来配置模型信息和所用的预测引擎。它的protobuf格式参考`configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。样例如下:
```JSON
engines {
name: "image_classification_resnet"
type: "FLUID_CPU_NATIVE_DIR"
reloadable_meta: "./data/model/paddle/fluid_time_file"
reloadable_type: "timestamp_ne"
model_data_path: "./data/model/paddle/fluid/SE_ResNeXt50_32x4d"
runtime_thread_num: 0
batch_infer_size: 0
enable_batch_align: 0
sparse_param_service_type: LOCAL
sparse_param_service_table_name: "local_kv"
enable_memory_optimization: true
static_optimization: false
force_update_static_cache: false
}
```
其中
- name: 模型名称。InferManager通过此名称,找到要使用的模型和预测引擎。可参考serving/op/classify_op.h与serving/op/classify_op.cpp的InferManager::instance().infer()方法的参数来了解。
- type: 预测引擎的类型。可在inferencer-fluid-cpu/src/fluid_cpu_engine.cpp找到当前注册的预测引擎列表
|预测引擎|含义|
|--------|----|
|FLUID_CPU_ANALYSIS|使用fluid Analysis API;模型所有参数保存在一个文件|
|FLUID_CPU_ANALYSIS_DIR|使用fluid Analysis API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_CPU_NATIVE|使用fluid Native API;模型所有参数保存在一个文件|
|FLUID_CPU_NATIVE_DIR|使用fluid Native API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_GPU_ANALYSIS|GPU预测,使用fluid Analysis API;模型所有参数保存在一个文件|
|FLUID_GPU_ANALYSIS_DIR|GPU预测,使用fluid Analysis API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_GPU_NATIVE|GPU预测,使用fluid Native API;模型所有参数保存在一个文件|
|FLUID_GPU_NATIVE_DIR|GPU预测,使用fluid Native API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
**fluid Analysis API和fluid Native API的区别**
Analysis API在模型加载过程中,会对模型计算逻辑进行多种优化,包括但不限于zero copy tensor,相邻OP的fuse等。**但优化逻辑不是一定对所有模型都有加速作用,有时甚至会有反作用,请以实测结果为准**
- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|含义|
|---------------|----|
|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
|md5sum|目前无用,配置后永远不reload|
|revision|目前无用,配置后用于不reload|
- model_data_path: 模型文件路径
- runtime_thread_num: 若大于0, 则启用bsf多线程调度框架,在每个预测bthread worker内启动多线程预测。要注意的是,当启用worker内多线程预测,workflow中OP需要用Serving框架的BatchTensor类做预测的输入和输出 (predictor/framework/infer_data.h, `class BatchTensor`)。
- batch_infer_size: 启用bsf多线程预测时,每个预测线程的batch size
- enable_batch_align:
- sparse_param_service_type: 枚举类型,可选参数,大规模稀疏参数服务类型
|sparse_param_service_type|含义|
|-------------------------|--|
|NONE|不使用大规模稀疏参数服务|
|LOCAL|单机本地大规模稀疏参数服务,以rocksdb作为引擎|
|REMOTE|分布式大规模稀疏参数服务,以Cube作为引擎|
- sparse_param_service_table_name: 可选参数,大规模稀疏参数服务承载本模型所用参数的表名。
- enable_memory_optimization: bool类型,可选参数,是否启用内存优化。只在使用fluid Analysis预测API时有意义。需要说明的是,在GPU预测时,会执行显存优化
- static_optimization: bool类型,是否执行静态优化。只有当启用内存优化时有意义。
- force_update_static_cache: bool类型,是否强制更新静态优化cache。只有当启用内存优化时有意义。
## 5. 命令行配置参数
以下是serving端支持的gflag配置选项列表,并提供了默认值。
| name | 默认值 | 含义 |
|------|--------|------|
|workflow_path|./conf|workflow配置目录名|
|workflow_file|workflow.prototxt|workflow配置文件名|
|inferservice_path|./conf|service配置目录名|
|inferservice_file|service.prototxt|service配置文件名|
|resource_path|./conf|资源管理器目录名|
|resource_file|resource.prototxt|资源管理器文件名|
|reload_interval_s|10|重载线程间隔时间(s)|
|enable_model_toolkit|true|模型管理|
|enable_protocol_list|baidu_std|brpc 通信协议列表|
|log_dir|./log|log dir|
|num_threads||brpc server使用的系统线程数,默认为CPU核数|
|port|8010|Serving进程接收请求监听端口|
|gpuid|0|GPU预测时指定Serving进程使用的GPU device id。只允许绑定1张GPU卡|
|bthread_concurrency|9|BRPC底层bthread的concurrency。在使用GPU预测引擎时,为了限制并发worker数,可使用此参数|
|bthread_min_concurrency|4|BRPC底层bthread的min concurrency。在使用GPU预测引擎时,为限制并发worker数,可使用此参数。与bthread_concurrency结合使用|
可以通过在serving/conf/gflags.conf覆盖默认值,例如
```
--log_dir=./serving_log/
```
将指定日志目录到./serving_log目录下
### 5.1 gflags.conf
可以将命令行配置参数写到配置文件中,该文件路径默认为`conf/gflags.conf`。如果`conf/gflags.conf`存在,则serving端会尝试解析其中的gflags命令。例如
```shell
--enable_model_toolkit
--port=8011
```
可用以下命令指定另外的命令行参数配置文件
```shell
bin/serving --g=true --flagfile=conf/gflags.conf.new
```
# 怎样保存用于Paddle Serving的模型?
(简体中文|[English](./SAVE.md))
(简体中文|[English](./Save_EN.md))
## 从已保存的模型文件中导出
如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
......
# How to save a servable model of Paddle Serving?
([简体中文](./SAVE_CN.md)|English)
([简体中文](./Save_CN.md)|English)
## Export from saved model files
......
......@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
### Step 1:启动Serving服务
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./PADDLE_SERVING_ON_KUBERNETES.md)
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes.md)
在这里我们直接执行
```
......
# Serving Configuration
(简体中文|[English](Serving_Configure_EN.md))
## 简介
本文主要介绍C++ Serving以及Python Pipeline的各项配置:
- [模型配置文件](#模型配置文件): 转换模型时自动生成,描述模型输入输出信息
- [C++ Serving](#c-serving): 用于高性能场景,介绍了快速启动以及自定义配置方法
- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景
## 模型配置文件
在开始介绍Server配置之前,先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时,会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt,两者内容一致,为模型输入输出的参数信息,方便用户拼装参数。该配置文件用于Server以及Client,并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`
样例如下:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "concat_1.tmp_0"
alias_name: "concat_1.tmp_0"
is_lod_tensor: false
fetch_type: 1
shape: 3
shape: 640
shape: 640
}
```
其中
- feed_var:模型输入
- fetch_var:模型输出
- name:名称
- alias_name:别名,与名称对应
- is_lod_tensor:是否为lod,具体可参考《[Lod字段说明](LOD_CN.md)
- feed_type:数据类型
|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:数据维度
## C++ Serving
### 1.快速启动
可以通过配置模型及端口号快速启动服务,启动命令如下:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
该命令会自动生成配置文件,并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录,其结构如下
```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```
更多启动参数详见下表:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 当您的某个模型想使用多张GPU卡部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
#### 当您的一个服务包含两个模型部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
### 2.自定义配置启动
一般情况下,自动生成的配置可以应对大部分场景。对于特殊场景,用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:
```BASH
/bin/serving --flagfile=proj.conf
```
#### 2.1 proj.conf
proj.conf用于传入服务参数,并指定了其他相关配置文件的路径。如果重复传入参数,则以最后序参数值为准。
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
各项参数的描述及默认值详见下表:
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|
#### 2.2 service.prototxt
service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path``--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf``InferServiceConf`。示例如下:
```
port: 8010
services {
name: "GeneralModelService"
workflows: "workflow1"
}
```
其中:
- port: 用于配置Serving实例监听的端口号。
- services: 使用默认配置即可,不可修改。name指定service名称,workflow1的具体定义在workflow.prototxt
#### 2.3 workflow.prototxt
workflow.prototxt用来描述具体的workflow。通过`--workflow_path``--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf``Workflow`类型。
如下示例,workflow由3个OP构成,GeneralReaderOp用于读取数据,GeneralInferOp依赖于GeneralReaderOp并进行预测,GeneralResponseOp将预测结果返回:
```
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "general_reader_0"
type: "GeneralReaderOp"
}
nodes {
name: "general_infer_0"
type: "GeneralInferOp"
dependencies {
name: "general_reader_0"
mode: "RO"
}
}
nodes {
name: "general_response_0"
type: "GeneralResponseOp"
dependencies {
name: "general_infer_0"
mode: "RO"
}
}
}
```
其中:
- name: workflow名称,用于从service.prototxt索引到具体的workflow
- workflow_type: 只支持"Sequence"
- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 与node.type一一对应,具体可参考`python/paddle_serving_server/dag.py`
- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
- node.dependencies.mode: RO-Read Only, RW-Read Write
#### 2.4 resource.prototxt
resource.prototxt,用于指定模型配置文件。通过`--resource_path``--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto``ResourceConf`。示例如下:
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
```
其中:
- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- general_model_path: 用来指定general_model.prototxt所在的目录
- general_model_file: 用来指定general_model.prototxt所在的文件名
#### 2.5 model_toolkit.prototxt
用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下:
```
engines {
name: "general_infer_0"
type: "PADDLE_INFER"
reloadable_meta: "uci_housing_model/fluid_time_file"
reloadable_type: "timestamp_ne"
model_dir: "uci_housing_model"
gpu_ids: -1
enable_memory_optimization: true
enable_ir_optimization: false
use_trt: false
use_lite: false
use_xpu: false
use_gpu: false
combined_model: false
gpu_multi_stream: false
runtime_thread_num: 0
batch_infer_size: 32
enable_overrun: false
allow_split_request: true
}
```
其中
- name: 引擎名称,与workflow.prototxt中的node.name以及所在目录名称对应
- type: 预测引擎的类型。当前只支持”PADDLE_INFER“
- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|含义|
|---------------|----|
|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
|md5sum|目前无用,配置后永远不reload|
|revision|目前无用,配置后用于不reload|
- model_dir: 模型文件路径
- gpu_ids: 引擎运行时使用的GPU device id,支持指定多个,如:
```
# 指定GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: 是否开启memory优化
- enable_ir_optimization: 是否开启ir优化
- use_trt: 是否开启TensorRT,需同时开启use_gpu
- use_lite: 是否开启PaddleLite
- use_xpu: 是否使用昆仑XPU
- use_gpu:是否使用GPU
- combined_model: 是否使用组合模型文件
- gpu_multi_stream: 是否开启gpu多流模式
- runtime_thread_num: 若大于0, 则启用Async异步模式,并创建对应数量的predictor实例。
- batch_infer_size: Async异步模式下的最大batch数
- enable_overrun: Async异步模式下总是将整个任务放入任务队列
- allow_split_request: Async异步模式下允许拆分任务
#### 2.6 general_model.prototxt
general_model.prototxt内容与模型配置serving_server_conf.prototxt相同,用了描述模型输入输出参数信息。示例如下:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "fc_0.tmp_1"
alias_name: "price"
is_lod_tensor: false
fetch_type: 1
shape: 1
}
```
## Python Pipeline
Python Pipeline提供了用户友好的多模型组合服务编程框架,适用于多模型组合应用的场景。
其配置文件为YAML格式,一般默认为config.yaml。示例如下:
```YAML
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 18090
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 9999
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: false
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
#重试次数
retry: 1
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
use_profile: false
tracer:
interval_s: 10
op:
det:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 6
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#det模型路径
model_config: ocr_det_model
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["concat_1.tmp_0"]
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
rec:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 3
#超时时间, 单位ms
timeout: -1
#Serving交互重试次数,默认不重试
retry: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor。local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#rec模型路径
model_config: ocr_rec_model
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
```
### 单机多卡
单机多卡推理,M个OP进程与N个GPU卡绑定,需要在config.ymal中配置3个参数。首先选择进程模式,这样并发数即进程数,然后配置devices。绑定方法是进程启动时遍历GPU卡ID,例如启动7个OP进程,设置了0,1,2三个device id,那么第1、4、7个启动的进程与0卡绑定,第2、5进程与1卡绑定,3、6进程与卡2绑定。
```YAML
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 7
devices: "0,1,2"
```
### 异构硬件
Python Pipeline除了支持CPU、GPU之外,还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定,当其空缺时根据devices自动判断类型。device_type描述如下:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
config.yml中硬件配置:
```YAML
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" # "0,1"
```
### 低精度推理
Python Pipeline支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下所示:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16(TRT下有效)
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
```YAML
#precsion, 预测精度,降低预测精度可提升预测速度
#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
precision: "fp32"
```
\ No newline at end of file
# Serving Configuration
([简体中文](Serving_Configure_CN.md)|English)
## Overview
This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.
## Model Configuration
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
Example:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "concat_1.tmp_0"
alias_name: "concat_1.tmp_0"
is_lod_tensor: false
fetch_type: 1
shape: 3
shape: 640
shape: 640
}
```
- feed_var:model input
- fetch_var:model output
- name:node name
- alias_name:alias name
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD_EN.md)
- feed_type/fetch_type:data type
|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:tensor shape
## C++ Serving
### 1. Quick start
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
Example starting c++ serving:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
This command will generate the server configuration files as `workdir_9393`:
```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```
More flags:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### Serving model with multiple gpus.
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
#### Serving two models.
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
### 2. Starting with user-defined Configuration
Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.
Example starting with self-defined config:
```BASH
/bin/serving --flagfile=proj.conf
```
#### 2.1 proj.conf
You can provide proj.conf with lots of flags:
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
The table below sets out the detailed description:
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|
#### 2.2 service.prototxt
To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
```
port: 8010
services {
name: "GeneralModelService"
workflows: "workflow1"
}
```
- port: Listening port.
- services: No need to modify. The workflow1 is defined in workflow.prototxt.
#### 2.3 workflow.prototxt
To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.
In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.
```
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "general_reader_0"
type: "GeneralReaderOp"
}
nodes {
name: "general_infer_0"
type: "GeneralInferOp"
dependencies {
name: "general_reader_0"
mode: "RO"
}
}
nodes {
name: "general_response_0"
type: "GeneralResponseOp"
dependencies {
name: "general_infer_0"
mode: "RO"
}
}
}
```
- name: The name of workflow.
- workflow_type: "Sequence"
- nodes: A workflow consists of nodes.
- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
- node.type: The bound operator. Ref to OPS in `serving/op`.
- node.dependencies: The list of upstream dependent operators.
- node.dependencies.name: The name of dependent operators.
- node.dependencies.mode: RO-Read Only, RW-Read Write
#### 2.4 resource.prototxt
You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
```
- model_toolkit_path: The diectory path of model_toolkil.prototxt.
- model_toolkit_file: The file name of model_toolkil.prototxt.
- general_model_path: The diectory path of general_model.prototxt.
- general_model_file: The file name of general_model.prototxt.
#### 2.5 model_toolkit.prototxt
The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.
Example using cpu engine:
```
engines {
name: "general_infer_0"
type: "PADDLE_INFER"
reloadable_meta: "uci_housing_model/fluid_time_file"
reloadable_type: "timestamp_ne"
model_dir: "uci_housing_model"
gpu_ids: -1
enable_memory_optimization: true
enable_ir_optimization: false
use_trt: false
use_lite: false
use_xpu: false
use_gpu: false
combined_model: false
gpu_multi_stream: false
runtime_thread_num: 0
batch_infer_size: 32
enable_overrun: false
allow_split_request: true
}
```
- name: The name of engine corresponding to the node name in workflow.prototxt.
- type: Only support ”PADDLE_INFER“
- reloadable_meta: Specify the mark file of reload.
- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|Description|
|---------------|----|
|timestamp_ne|when the mtime of reloadable_meta file changed|
|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
|md5sum|No use|
|revision|No use|
- model_dir: The path of model files.
- gpu_ids: Specify the gpu ids. Support multiple device ids:
```
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: Enable memory optimization.
- enable_ir_optimization: Enable ir optimization.
- use_trt: Enable Tensor RT. Need use_gpu on.
- use_lite: Enable PaddleLite.
- use_xpu: Enable KUNLUN XPU.
- use_gpu: Enbale GPU.
- combined_model: Enable combined model.
- gpu_multi_stream: Enable gpu multiple stream mode.
- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
- batch_infer_size: The max batch size of Async mode.
- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
- allow_split_request: Allow to split request task in Async mode.
#### 2.6 general_model.prototxt
The content of general_model.prototxt is same as serving_server_conf.prototxt.
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "fc_0.tmp_1"
alias_name: "price"
is_lod_tensor: false
fetch_type: 1
shape: 1
}
```
## Python Pipeline
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
Example of config.yaml:
```YAML
#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
rpc_port: 18090
#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
http_port: 9999
#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
worker_num: 20
#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
build_dag_each_worker: false
dag:
#True, thread model;False,process model
is_thread_op: False
#retry times
retry: 1
# True,generate the TimeLine data;False
use_profile: false
tracer:
interval_s: 10
op:
det:
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 6
#Loading local server configuration without server_endpoints.
local_service_conf:
#client type,include brpc, grpc and local_predictor.
client_type: local_predictor
#det model path
model_config: ocr_det_model
#Fetch data list
fetch_list: ["concat_1.tmp_0"]
#Device ID
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
rec:
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 3
#time out, ms
timeout: -1
#retry times
retry: 1
#Loading local server configuration without server_endpoints.
local_service_conf:
#client type,include brpc, grpc and local_predictor.
client_type: local_predictor
#rec model path
model_config: ocr_rec_model
#Fetch data list
fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
#Device ID
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
```
### Single-machine and multi-card inference
Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
Reference config.yaml:
```YAML
#True, thread model;False,process model
is_thread_op: False
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 7
devices: "0,1,2"
```
### Heterogeneous Devices
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
Reference config.yaml:
```YAML
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
devices: "" # "0,1"
```
### Low precision inference
Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16(TRT effects)
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
```YAML
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
precision: "fp32"
```
\ No newline at end of file
# Paddle Serving设计文档
(简体中文|[English](./DESIGN_DOC.md))
(简体中文|[English](./Serving_Design_EN.md))
## 1. 设计目标
......@@ -55,15 +55,15 @@ Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会
> 跨平台运行
跨平台是不依赖于操作系统,也不依赖硬件环境。一个操作系统下开发的应用,放到另一个操作系统下依然可以运行。因此,设计上既要考虑开发语言、组件是跨平台的,同时也要考虑不同系统上编译器的解释差异。
Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](DOCKER_IMAGES_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](RUN_IN_DOCKER_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](WINDOWS_TUTORIAL_CN.md)
Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](Docker_Images_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](Run_In_Dokcer_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
> 支持多种开发语言SDK
Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。
Paddle Serving提供了3种开发语言SDK,包括Python、C++、Java。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。
+ Python,参考python/examples下client示例 或 4.2 web服务示例
+ C++,参考《[从零开始写一个预测服务](CREATING.md)
+ Java,参考《[Paddle Serving Client Java SDK](JAVA_SDK_CN.md)
+ Golang,参考《[如何在Paddle Serving使用Go Client](deprecated/IMDB_GO_CLIENT_CN.md)
+ C++,参考《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)
+ Java,参考《[Paddle Serving Client Java SDK](Java_SDK_CN.md)
> 支持多种硬件设备
......@@ -76,7 +76,7 @@ Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang
以IMDB评论情感分析任务为例通过9步展示,Paddle Serving从模型的训练到部署预测服务的全流程《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。
由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](Save_CN.md)》。
```
feed_var {
name: "x"
......@@ -124,15 +124,15 @@ C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点
### 3.3 模型管理与热加载
Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)》。
Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](C++_Serving/Hot_Loading_CN.md)》。
### 3.4 模型加解密
Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](ENCRYPTION_CN.md)
Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](C++_Serving/Encryption_CN.md)
### 3.5 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)》。
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](C++_Serving/ABTEST_CN.md)》。
<p align="center">
<br>
......@@ -193,7 +193,7 @@ Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC
</center>
### 5.2 核心设计与使用用例
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](Python_Pipeline/Pipeline_Design_CN.md)
<center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
......@@ -201,11 +201,8 @@ Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Chann
## 6. 未来计划
### 6.1 云端自动部署能力
为了方便用户更容易将Paddle的预测模型部署到线上,Paddle Serving在接下来的版本会提供Kubernetes生态下任务编排的工具。
### 6.2 向量检索、树结构检索
### 6.1 向量检索、树结构检索
在推荐与广告场景的召回系统中,通常需要采用基于向量的快速检索或者基于树结构的快速检索,Paddle Serving会对这方面的检索引擎进行集成或扩展。
### 6.3 服务监控
### 6.2 服务监控
集成普罗米修斯监控,一套开源的监控&报警&时间序列数据库的组合,适合k8s和docker的监控系统。
# Paddle Serving Design Doc
([简体中文](./DESIGN_DOC_CN.md)|English)
([简体中文](./Serving_Design_CN.md)|English)
## 1. Design Objectives
......@@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating
Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems.
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md)
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](Docker_Images_EN.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](Run_In_Docker_EN.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
> Support multiple development languages client ​​SDKs
Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR.
Paddle Serving provides 3 development language client SDKs, including Python, C++, Java, we hope that interested open source developers can help submit PR.
+ Python, Refer to the client example under python/examples or 4.2 web service example.
+ C++, Refer to《[从零开始写一个预测服务](CREATING.md)
+ Java, Refer to《[Paddle Serving Client Java SDK](JAVA_SDK.md)
+ Golang, Refer to《[How to use Go Client of Paddle Serving](deprecated/IMDB_GO_CLIENT.md)
+ C++, Refer to《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)
+ Java, Refer to《[Paddle Serving Client Java SDK](Java_SDK_EN.md)
> Support multiple hardware devices
......@@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C
Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》.
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](Save_EN.md)》.
```
feed_var {
name: "x"
......@@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
<p>
### 3.3 Model Management and Hot Reloading
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples.
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](C++_Serving/Hot_Loading_EN.md) for specific examples.
### 3.4 MOEDL ENCRYPTION INFERENCE
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md)
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](C++_Serving/Encryption_EN.md)
### 3.5 A/B Test
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples.
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](C++_Serving/ABTEST_EN.md) for specific examples.
<p align="center">
<br>
......@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
### 5.2 Core Design And Use Cases
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](Python_Pipeline/Pipeline_Design_EN.md)
<center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
......
## Paddle Serving uses TensorRT
(English|[简体中文](./TENSOR_RT_CN.md))
### Background
Deploying models trained on mainstream frameworks through the tensorRT tool launched by Nvidia can greatly increase the speed of model inference, which is often at least 1 times faster than the original framework, and it also takes up more device memory. less. Therefore, it is very useful for all users who need to deploy models to master the method of deploying deep learning models with tensorRT. Paddle Serving provides comprehensive TensorRT ecological support.
### surroundings
Serving Cuda10.1 Cuda10.2 and Cuda11 versions support TensorRT.
#### Install Paddle
In [Development using Docker environment](./RUN_IN_DOCKER.md) and [Docker image list](./DOCKER_IMAGES.md), we give the development image of TensorRT. After using the mirror to start, you need to install the Paddle whl package that supports TensorRT, refer to the documentation on the home page
```
# GPU Cuda10.2 environment please execute
pip install paddlepaddle-gpu==2.0.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the URL link of the corresponding GPU environment and install it. For example, for Python2.7 users of Cuda 10.1, please select `cp27-cp27mu` and
`cuda10.1-cudnn7.6-trt6.0.1.5` corresponding url, copy it and execute
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
Since the default `paddlepaddle-gpu==2.0.0` is Cuda 10.2 and TensorRT is not built, if you need to use TensorRT on `paddlepaddle-gpu`, you need to find `cuda10 in the above multi-version whl package list .2-cudnn8.0-trt7.1.3`, download the corresponding Python version.
#### Install Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### Use TensorRT
#### RPC mode
In [Serving model example](../python/examples), we have given models that can be accelerated using TensorRT, such as [Faster_RCNN model](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) under detection
We just need
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
The TensorRT version of the faster_rcnn model server is started
#### Local Predictor mode
In [local_predictor](../python/paddle_serving_app/local_predict.py#L52), users can explicitly specify `use_trt=True` and pass it to `load_model_config`.
Other methods are no different from other Local Predictor methods, and you need to pay attention to the compatibility of the model with TensorRT.
#### Pipeline Mode
In [Pipeline mode](./PIPELINE_SERVING.md), our [imagenet example](../python/examples/pipeline/imagenet/config.yml#L23) gives the way to set TensorRT.
## Paddle Serving 使用 TensorRT
([English](./TENSOR_RT.md)|简体中文)
### 背景
通过Nvidia推出的tensorRT工具来部署主流框架上训练的模型能够极大的提高模型推断的速度,往往相比与原本的框架能够有至少1倍以上的速度提升,同时占用的设备内存也会更加的少。因此对是所有需要部署模型的用户来说,掌握用tensorRT来部署深度学习模型的方法是非常有用的。Paddle Serving提供了全面的TensorRT生态支持。
### 环境
Serving 的Cuda10.1 Cuda10.2和Cuda11版本支持TensorRT。
#### 安装Paddle
[使用Docker环境开发](./RUN_IN_DOCKER_CN.md)[Docker镜像列表](./DOCKER_IMAGES_CN.md)当中,我们给出了TensorRT的开发镜像。使用镜像启动之后,需要安装支持TensorRT的Paddle whl包,参考首页的文档
```
# GPU Cuda10.2环境请执行
pip install paddlepaddle-gpu==2.0.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表
](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python2.7用户,请选择表格当中的`cp27-cp27mu`
`cuda10.1-cudnn7.6-trt6.0.1.5`对应的url,复制下来并执行
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.0.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。
#### 安装Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### 使用TensorRT
#### RPC模式
[Serving模型示例](../python/examples)当中,我们有给出可以使用TensorRT加速的模型,例如detection下的[Faster_RCNN模型](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco)
我们只需
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
TensorRT版本的faster_rcnn模型服务端就启动了
#### Local Predictor模式
[local_predictor](../python/paddle_serving_app/local_predict.py#L52)当中,用户可以显式制定`use_trt=True`传入到`load_model_config`当中。
其他方式和其他Local Predictor使用方法没有区别,需要注意模型对TensorRT的兼容性。
#### Pipeline模式
[Pipeline模式](./PIPELINE_SERVING_CN.md)当中,我们的[imagenet例子](../python/examples/pipeline/imagenet/config.yml#L23)给出了设置TensorRT的方式。
## Windows平台使用Paddle Serving指导
([English](./WINDOWS_TUTORIAL.md)|简体中文)
([English](./Windows_Turtial_EN.md)|简体中文)
### 综述
......@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./NEW_WEB_SERVICE_CN.md)
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./C++_Serving/Http_Service_CN.md)
开发完成后执行
......
## Paddle Serving for Windows Users
(English|[简体中文](./WINDOWS_TUTORIAL_CN.md))
(English|[简体中文](./Windows_Tutorial_CN.md))
### Summary
......@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./NEW_WEB_SERVICE.md)
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./C++_Serving/Http_Service_EN.md)
Execute after development
......
# C++ Serving Design
([简体中文](./C++DESIGN_CN.md)|English)
## 1. Background
PaddlePaddle is the Baidu's open source machine learning framework, which supports a wide range of customized development of deep learning models; Paddle serving is the online prediction framework of Paddle, which seamlessly connects with Paddle model training, and provides cloud services for machine learning prediction. This article will describe the Paddle Serving design from the bottom up, from the model, service, and access levels.
1. The model is the core of Paddle Serving prediction, including the management of model data and inference calculations;
2. Prediction framework encapsulation model for inference calculations, providing external RPC interface to connect different upstream
3. The prediction service SDK provides a set of access frameworks
The result is a complete serving solution.
## 2. Terms explanation
- **baidu-rpc**: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
- **Variant**: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
- **Endpoint**: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
- **OP**: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
- **Channel**: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
- **Bus**: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
- **Stage**: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
- **Node**: An OP operator instance composed of an OP operator class combined with parameter configuration, which is also an execution unit in Workflow
- **Workflow**: executes the inference interface of each OP in order according to the topology described by DAG
- **DAG/Workflow**: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
- **Service**: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.
## 3. Python Interface Design
### 3.1 Core Targets:
A set of Paddle Serving dynamic library, support the remote estimation service of the common model saved by Paddle, and call the various underlying functions of PaddleServing through the Python Interface.
### 3.2 General Model:
Models that can be predicted using the Paddle Inference Library, models saved during training, including Feed Variable and Fetch Variable
### 3.3 Overall design:
- The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
- The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
- The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
- The Server Python API is mainly responsible for loading the inference model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.
### 3.4 Server Inferface
![Server Interface](images/server_interface.png)
### 3.5 Client Interface
<img src='images/client_inferface.png' width = "600" height = "200">
### 3.6 Client io used during Training
PaddleServing is designed to saves the model interface that can be used during the training process, which is basically the same as the Paddle save inference model interface, feed_var_dict and fetch_var_dict
You can alias the input and output variables. The configuration that needs to be read when the serving starts is saved in the client and server storage directories.
``` python
def save_model(server_model_folder,
client_config_folder,
feed_var_dict,
fetch_var_dict,
main_program=None)
```
## 4. Paddle Serving Underlying Framework
![Paddle-Serging Overall Architecture](images/framework.png)
**Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
**Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
**Predict Service**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.
### 4.1 Model Management Framework
The model management framework is responsible for managing the models trained by the machine learning framework. It can be abstracted into three levels: model loading, model data, and model reasoning.
#### Model Loading
Load model from disk to memory, support multi-version, hot-load, incremental update, etc.
#### Model data
Model data structure in memory, integrated fluid inference lib
#### inferencer
it provided united inference interface for upper layers
```C++
class FluidFamilyCore {
virtual bool Run(const void* in_data, void* out_data);
virtual int create(const std::string& data_path);
virtual int clone(void* origin_core);
};
```
### 4.2 Business Scheduling Framework
#### 4.2.1 Inference Service
With reference to the abstract idea of model calculation of the TensorFlow framework, the business logic is abstracted into a DAG diagram, driven by configuration, generating a workflow, and skipping C ++ code compilation. Each specific step of the service corresponds to a specific OP. The OP can configure the upstream OP that it depends on. Unified message passing between OPs is achieved by the thread-level bus and channel mechanisms. For example, the service process of a simple prediction service can be abstracted into 3 steps including reading request data-> calling the prediction interface-> writing back the prediction result, and correspondingly implemented to 3 OP: ReaderOp-> ClassifyOp-> WriteOp
![Infer Service](images/predict-service.png)
Regarding the dependencies between OPs, and the establishment of workflows through OPs, you can refer to [从零开始写一个预测服务](CREATING.md) (simplified Chinese Version)
Server instance perspective
![Server instance perspective](images/server-side.png)
#### 4.2.2 Paddle Serving Multi-Service Mechanism
![Paddle Serving multi-service](images/multi-service.png)
Paddle Serving instances can load multiple models at the same time, and each model uses a Service (and its configured workflow) to undertake services. You can refer to [service configuration file in Demo example](../tools/cpp_examples/demo-serving/conf/service.prototxt) to learn how to configure multiple services for the serving instance
#### 4.2.3 Hierarchical relationship of business scheduling
From the client's perspective, a Paddle Serving service can be divided into three levels: Service, Endpoint, and Variant from top to bottom.
![Call hierarchy relationship](images/multi-variants.png)
One Service corresponds to one inference model, and there is one endpoint under the model. Different versions of the model are implemented through multiple variant concepts under endpoint:
The same model prediction service can configure multiple variants, and each variant has its own downstream IP list. The client code can configure relative weights for each variant to achieve the relationship of adjusting the traffic ratio (refer to the description of variant_weight_list in [Client Configuration](CLIENT_CONFIGURE.md) section 3.2).
![Client-side proxy function](images/client-side-proxy.png)
## 5. User Interface
Under the premise of meeting certain interface specifications, the service framework does not make any restrictions on user data fields to meet different business interfaces of various forecast services. Baidu-rpc inherits the interface of Protobuf serice, and the user describes the Request and Response business interfaces according to the Protobuf syntax specification. Paddle Serving is built on the Baidu-rpc framework and supports this feature by default.
No matter how the communication protocol changes, the framework only needs to ensure that the communication protocol between the client and server and the format of the business data are synchronized to ensure normal communication. This information can be broken down as follows:
-Protocol: Header information agreed in advance between Server and Client to ensure mutual recognition of data format. Paddle Serving uses Protobuf as the basic communication format
-Data: Used to describe the interface of Request and Response, such as the sample data to be predicted, and the score returned by the prediction. include:
   -Data fields: Field definitions included in the two data structures of Request and Return.
   -Description interface: similar to the protocol interface, it supports Protobuf by default
### 5.1 Data Compression Method
Baidu-rpc has built-in data compression methods such as snappy, gzip, zlib, which can be configured in the configuration file (refer to [Client Configuration](CLIENT_CONFIGURE.md) Section 3.1 for an introduction to compress_type)
### 5.2 C ++ SDK API Interface
```C++
class PredictorApi {
public:
int create(const char* path, const char* file);
int thrd_initialize();
int thrd_clear();
int thrd_finalize();
void destroy();
Predictor* fetch_predictor(std::string ep_name);
int free_predictor(Predictor* predictor);
};
class Predictor {
public:
// synchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res) = 0;
// asynchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res,
DoneType done,
brpc::CallId* cid = NULL) = 0;
// synchronize interface
virtual int debug(google::protobuf::Message* req,
google::protobuf::Message* res,
butil::IOBufBuilder* debug_os) = 0;
};
```
### 5.3 Inferfaces related to Op
```C++
class Op {
// ------Getters for Channel/Data/Message of dependent OP-----
// Get the Channel object of dependent OP
Channel* mutable_depend_channel(const std::string& op);
// Get the Channel object of dependent OP
const Channel* get_depend_channel(const std::string& op) const;
template <typename T>
T* mutable_depend_argument(const std::string& op);
template <typename T>
const T* get_depend_argument(const std::string& op) const;
// -----Getters for Channel/Data/Message of current OP----
// Get pointer to the progobuf message of current OP
google::protobuf::Message* mutable_message();
// Get pointer to the protobuf message of current OP
const google::protobuf::Message* get_message() const;
// Get the template class data object of current OP
template <typename T>
T* mutable_data();
// Get the template class data object of current OP
template <typename T>
const T* get_data() const;
// ---------------- Other base class members ----------------
int init(Bus* bus,
Dag* dag,
uint32_t id,
const std::string& name,
const std::string& type,
void* conf);
int deinit();
int process(bool debug);
// Get the input object
const google::protobuf::Message* get_request_message();
const std::string& type() const;
uint32_t id() const;
// ------------------ OP Interface -------------------
// Get the derived Channel object of current OP
virtual Channel* mutable_channel() = 0;
// Get the derived Channel object of current OP
virtual const Channel* get_channel() const = 0;
// Release the derived Channel object of current OP
virtual int release_channel() = 0;
// Inference interface
virtual int inference() = 0;
// ------------------ Conf Interface -------------------
virtual void* create_config(const configure::DAGNode& conf) { return NULL; }
virtual void delete_config(void* conf) {}
virtual void set_config(void* conf) { return; }
// ------------------ Metric Interface -------------------
virtual void regist_metric() { return; }
};
```
### 5.4 Interfaces related to framework
Service
```C++
class InferService {
public:
static const char* tag() { return "service"; }
int init(const configure::InferService& conf);
int deinit() { return 0; }
int reload();
const std::string& name() const;
const std::string& full_name() const { return _infer_service_format; }
// Execute each workflow serially
virtual int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os = NULL);
int debug(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os);
};
class ParallelInferService : public InferService {
public:
// Execute workflows in parallel
int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os) {
return 0;
}
};
```
ServerManager
```C++
class ServerManager {
public:
typedef google::protobuf::Service Service;
ServerManager();
static ServerManager& instance() {
static ServerManager server;
return server;
}
static bool reload_starting() { return _s_reload_starting; }
static void stop_reloader() { _s_reload_starting = false; }
int add_service_by_format(const std::string& format);
int start_and_wait();
};
```
DAG
```C++
class Dag {
public:
EdgeMode parse_mode(std::string& mode); // NOLINT
int init(const char* path, const char* file, const std::string& name);
int init(const configure::Workflow& conf, const std::string& name);
int deinit();
uint32_t nodes_size();
const DagNode* node_by_id(uint32_t id);
const DagNode* node_by_id(uint32_t id) const;
const DagNode* node_by_name(std::string& name); // NOLINT
const DagNode* node_by_name(const std::string& name) const;
uint32_t stage_size();
const DagStage* stage_by_index(uint32_t index);
const std::string& name() const { return _dag_name; }
const std::string& full_name() const { return _dag_name; }
void regist_metric(const std::string& service_name);
};
```
Workflow
```C++
class Workflow {
public:
Workflow() {}
static const char* tag() { return "workflow"; }
// Each workflow object corresponds to an independent
// configure file, so you can share the object between
// different apps.
int init(const configure::Workflow& conf);
DagView* fetch_dag_view(const std::string& service_name);
int deinit() { return 0; }
void return_dag_view(DagView* view);
int reload();
const std::string& name() { return _name; }
const std::string& full_name() { return _name; }
};
```
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
# FAQ
## 1. 如何修改端口配置?
使用该框架搭建的服务需要申请一个端口,可以通过以下方式修改端口号:
- 如果在inferservice_file里指定了port:xxx,那么就去申请该端口号;
- 否则,如果在gflags.conf里指定了--port:xxx,那就去申请该端口号;
- 否则,使用程序里指定的默认端口号:8010。
## 2. GPU预测中为何请求的响应时间波动会非常大?
PaddleServing依托PaddlePaddle预测库执行预测计算;在GPU设备上,由于同一个进程内目前共用1个GPU stream,进程内的多个请求的预测计算会被严格串行。所以如果有2个请求同时到达某个Serving实例,不管该实例启动时创建了多少个worker线程,都不能起到加速作用,后到的请求会被排队,直到前面请求计算完成。
## 3. 如何充分利用GPU卡的计算能力?
如问题2所说,由于预测库的限制,单个Serving进程只能绑定单张GPU卡,且进程内共用1个GPU stream,所有请求必须串行计算。
为提高GPU卡使用率,目前可以想到的方法是:在单张GPU卡上启动多个Serving进程,每个进程绑定一个GPU stream,多个stream并行计算。这种方法是否能起到加速作用,受限于多个因素,主要有:
1. 单个stream占用GPU算力;假如单个stream已经将GPU算力占用超过50%,那么增加stream很可能会导致2个stream的job分别排队,拖慢各自的响应时间
2. GPU显存:Serving进程需要将模型参数加载到显存中,并且计算时要在GPU显存池分配临时变量;假如单个Serving进程已经用掉超过50%的显存,则增加Serving进程会造成显存不足,导致进程报错退出
为此,可采用如下步骤,进行测试:
1. 加载模型时,在model_toolkit.prototxt中,model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR;会对模型进行静态分析,进行一定程度显存优化
2. 在步骤1完成后,启动单个Serving进程,启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`;启动一个client,进行并发度为1的压力测试,batch size从小到大,记下平响;由于算力的限制,当batch size增大到一定程度,应该会出现响应时间明显变大;或虽然没有明显变大,但已经不满足系统需求
3. 再启动1个Serving进程,与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口;然后同时对这2个Serving进程进行压测,继续观察batch size从小到大时平均响应时间的变化,直到取得batch size和响应时间的折中
4. 重复步骤2-3
5. 以2-4步的测试,来决定:单张GPU卡可以由多少个Serving进程共用; 实际部署时,就在一张GPU卡上启动这么多个Serving进程同时提供服务
此差异已折叠。
## Image Classification
([简体中文](./README_CN.md)|English)
The example uses the ResNet50_vd model to perform the imagenet 1000 classification task.
### Get model config and sample dataset
```
sh get_model.sh
```
### Install preprocess module
```
pip3 install paddle_serving_app
```
### Inference Service(Support BRPC-Client/GRPC-Client/Http-Client)
launch server side
```
python3 -m paddle_serving_server.serve --model ResNet50_vd_model --port 9696 #cpu inference service
```
```
python3 -m paddle_serving_server.serve --model ResNet50_vd_model --port 9696 --gpu_ids 0 #gpu inference service
```
### BRPC-Client
client send inference request
```
python3 resnet50_rpc_client.py ResNet50_vd_client_config/serving_client_conf.prototxt
```
*the port of server side in this example is 9696
### GRPC-Client/Http-Client
client send inference request
```
python3 resnet50_http_client.py ResNet50_vd_client_config/serving_client_conf.prototxt
```
此差异已折叠。
此差异已折叠。
此差异已折叠。
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imagenet-example/ResNet50_vd.tar.gz
tar -xzvf ResNet50_vd.tar.gz
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imagenet-example/ResNet101_vd.tar.gz
tar -xzvf ResNet101_vd.tar.gz
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imagenet-example/image_data.tar.gz
tar -xzvf image_data.tar.gz
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册