未验证 提交 511ac38c 编写于 作者: T TeslaZhao 提交者: GitHub

Merge pull request #922 from TeslaZhao/v0.4.0

Merge pull request #921 from TeslaZhao/develop
...@@ -7,14 +7,46 @@ Paddle Serving is usually used for the deployment of single model, but the end-t ...@@ -7,14 +7,46 @@ Paddle Serving is usually used for the deployment of single model, but the end-t
Paddle Serving provides a user-friendly programming framework for multi-model composite services, Pipeline Serving, which aims to reduce the threshold of programming, improve resource utilization (especially GPU), and improve the prediction efficiency. Paddle Serving provides a user-friendly programming framework for multi-model composite services, Pipeline Serving, which aims to reduce the threshold of programming, improve resource utilization (especially GPU), and improve the prediction efficiency.
## Architecture Design ## Architecture Design
The Server side is built based on gRPC and graph execution engine. The relationship between them is shown in the following figure. The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
<center> <center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/> <img src='pipeline_serving-image1.png' height = "250" align="middle"/>
</center> </center>
### Graph Execution Engine
### 1. RPC Service
In order to meet the needs of different users, the RPC service starts one Web server and one RPC server at the same time, and can process 2 types of requests, RESTful API and gRPC.The gPRC gateway receives RESTful API requests and forwards requests to the gRPC server through the reverse proxy server; gRPC requests are received by the gRPC server, so the two types of requests are processed by the gRPC Service in a unified manner to ensure that the processing logic is consistent.
#### <b>1.1 Request and Respose of proto
gRPC service and gRPC gateway service are generated with service.proto.
```proto
message Request {
repeated string key = 1;
repeated string value = 2;
optional string name = 3;
optional string method = 4;
optional int64 logid = 5;
optional string clientip = 6;
};
message Response {
optional int32 err_no = 1;
optional string err_msg = 2;
repeated string key = 3;
repeated string value = 4;
};
```
The `key` and `value` in the Request are paired string arrays. The `name` and `method` correspond to the URL of the RESTful API://{ip}:{port}/{name}/{method}.The `logid` and `clientip` are convenient for users to connect service-level requests and customize strategies.
In Response, `err_no` and `err_msg` express the correctness and error information of the processing result, and `key` and `value` are the returned results.
### 2. Graph Execution Engine
The graph execution engine consists of OPs and Channels, and the connected OPs share one Channel. The graph execution engine consists of OPs and Channels, and the connected OPs share one Channel.
...@@ -28,7 +60,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s ...@@ -28,7 +60,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
</center> </center>
### OP Design #### <b>2.1 OP Design</b>
- The default function of a single OP is to access a single Paddle Serving Service based on the input Channel data and put the result into the output Channel. - The default function of a single OP is to access a single Paddle Serving Service based on the input Channel data and put the result into the output Channel.
- OP supports user customization, including preprocess, process, postprocess functions that can be inherited and implemented by the user. - OP supports user customization, including preprocess, process, postprocess functions that can be inherited and implemented by the user.
...@@ -36,7 +68,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s ...@@ -36,7 +68,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- OP can obtain data from multiple different RPC requests for Auto-Batching. - OP can obtain data from multiple different RPC requests for Auto-Batching.
- OP can be started by a thread or process. - OP can be started by a thread or process.
### Channel Design #### <b>2.2 Channel Design</b>
- Channel is the data structure for sharing data between OPs, responsible for sharing data or sharing data status information. - Channel is the data structure for sharing data between OPs, responsible for sharing data or sharing data status information.
- Outputs from multiple OPs can be stored in the same Channel, and data from the same Channel can be used by multiple OPs. - Outputs from multiple OPs can be stored in the same Channel, and data from the same Channel can be used by multiple OPs.
...@@ -47,8 +79,17 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s ...@@ -47,8 +79,17 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
</center> </center>
#### <b>2.3 client type design</b>
### Extreme Case Consideration - Prediction type (client_type) of Op has 3 types, brpc, grpc and local_predictor
- brpc: Using bRPC Client to interact with remote Serving by network, performance is better than grpc.
- grpc: Using gRPC Client to interact with remote Serving by network, cross-platform deployment supported.
- local_predictor: Load the model and predict in the local service without interacting with the network. Support multi-card deployment, and TensorRT prediction.
- Selection:
- Time cost(lower is better): local_predict < brpc <= grpc
- Microservice: Split the brpc or grpc model into independent services, simplify development and deployment complexity, and improve resource utilization
#### <b>2.4 Extreme Case Consideration</b>
- Request timeout - Request timeout
...@@ -65,9 +106,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s ...@@ -65,9 +106,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- For output buffer, you can use a similar process as input buffer, which adjusts the concurrency of OP3 and OP4 to control the buffer length of output buffer. (The length of the output buffer depends on the speed at which downstream OPs obtain data from the output buffer) - For output buffer, you can use a similar process as input buffer, which adjusts the concurrency of OP3 and OP4 to control the buffer length of output buffer. (The length of the output buffer depends on the speed at which downstream OPs obtain data from the output buffer)
- The amount of data in the Channel will not exceed `worker_num` of gRPC, that is, it will not exceed the thread pool size. - The amount of data in the Channel will not exceed `worker_num` of gRPC, that is, it will not exceed the thread pool size.
## Detailed Design ## ★ Detailed Design
### User Interface Design
#### 1. General OP Definition #### 1. General OP Definition
...@@ -79,11 +118,13 @@ def __init__(name=None, ...@@ -79,11 +118,13 @@ def __init__(name=None,
server_endpoints=[], server_endpoints=[],
fetch_list=[], fetch_list=[],
client_config=None, client_config=None,
client_type=None,
concurrency=1, concurrency=1,
timeout=-1, timeout=-1,
retry=1, retry=1,
batch_size=1, batch_size=1,
auto_batching_timeout=None) auto_batching_timeout=None,
local_service_handler=None)
``` ```
The meaning of each parameter is as follows: The meaning of each parameter is as follows:
...@@ -92,14 +133,16 @@ The meaning of each parameter is as follows: ...@@ -92,14 +133,16 @@ The meaning of each parameter is as follows:
| :-------------------: | :----------------------------------------------------------: | | :-------------------: | :----------------------------------------------------------: |
| name | (str) String used to identify the OP type, which must be globally unique. | | name | (str) String used to identify the OP type, which must be globally unique. |
| input_ops | (list) A list of all previous OPs of the current Op. | | input_ops | (list) A list of all previous OPs of the current Op. |
| server_endpoints | (list) List of endpoints for remote Paddle Serving Service. If this parameter is not set, the OP will not access the remote Paddle Serving Service, that is, the process operation will not be performed. | | server_endpoints | (list) List of endpoints for remote Paddle Serving Service. If this parameter is not set,it is considered as local_precditor mode, and the configuration is read from local_service_conf |
| fetch_list | (list) List of fetch variable names for remote Paddle Serving Service. | | fetch_list | (list) List of fetch variable names for remote Paddle Serving Service. |
| client_config | (str) The path of the client configuration file corresponding to the Paddle Serving Service. | | client_config | (str) The path of the client configuration file corresponding to the Paddle Serving Service. |
| client_type | (str)brpc, grpc or local_predictor. local_predictor does not start the Serving service, in-process prediction|
| concurrency | (int) The number of concurrent OPs. | | concurrency | (int) The number of concurrent OPs. |
| timeout | (int) The timeout time of the process operation, in ms. If the value is less than zero, no timeout is considered. | | timeout | (int) The timeout time of the process operation, in ms. If the value is less than zero, no timeout is considered. |
| retry | (int) Timeout number of retries. When the value is 1, no retries are made. | | retry | (int) Timeout number of retries. When the value is 1, no retries are made. |
| batch_size | (int) The expected batch_size of Auto-Batching, since building batches may time out, the actual batch_size may be less than the set value. | | batch_size | (int) The expected batch_size of Auto-Batching, since building batches may time out, the actual batch_size may be less than the set value. |
| auto_batching_timeout | (float) Timeout for building batches of Auto-Batching (the unit is ms). | | auto_batching_timeout | (float) Timeout for building batches of Auto-Batching (the unit is ms). When batch_size> 1, auto_batching_timeout should be set, otherwise the waiting will be blocked when the number of requests is insufficient for batch_size|
| local_service_handler | (object) local predictor handler,assigned by Op init() input parameters or created in Op init()|
#### 2. General OP Secondary Development Interface #### 2. General OP Secondary Development Interface
...@@ -156,7 +199,7 @@ def init_op(self): ...@@ -156,7 +199,7 @@ def init_op(self):
It should be **noted** that in the threaded version of OP, each OP will only call this function once, so the loaded resources must be thread safe. It should be **noted** that in the threaded version of OP, each OP will only call this function once, so the loaded resources must be thread safe.
#### 3. RequestOp Definition #### 3. RequestOp Definition and Secondary Development Interface
RequestOp is used to process RPC data received by Pipeline Server, and the processed data will be added to the graph execution engine. Its constructor is as follows: RequestOp is used to process RPC data received by Pipeline Server, and the processed data will be added to the graph execution engine. Its constructor is as follows:
...@@ -164,7 +207,7 @@ RequestOp is used to process RPC data received by Pipeline Server, and the proce ...@@ -164,7 +207,7 @@ RequestOp is used to process RPC data received by Pipeline Server, and the proce
def __init__(self) def __init__(self)
``` ```
#### 4. RequestOp Secondary Development Interface When the default RequestOp cannot meet the parameter parsing requirements, you can customize the request parameter parsing method by rewriting the following two interfaces.
| Interface or Variable | Explain | | Interface or Variable | Explain |
| :---------------------------------------: | :----------------------------------------------------------: | | :---------------------------------------: | :----------------------------------------------------------: |
...@@ -188,7 +231,7 @@ def unpack_request_package(self, request): ...@@ -188,7 +231,7 @@ def unpack_request_package(self, request):
The return value is required to be a dictionary type. The return value is required to be a dictionary type.
#### 5. ResponseOp Definition #### 4. ResponseOp Definition and Secondary Development Interface
ResponseOp is used to process the prediction results of the graph execution engine. The processed data will be used as the RPC return value of Pipeline Server. Its constructor is as follows: ResponseOp is used to process the prediction results of the graph execution engine. The processed data will be used as the RPC return value of Pipeline Server. Its constructor is as follows:
...@@ -198,7 +241,7 @@ def __init__(self, input_ops) ...@@ -198,7 +241,7 @@ def __init__(self, input_ops)
`input_ops` is the last OP of graph execution engine. Users can construct different DAGs by setting different `input_ops` without modifying the topology of OPs. `input_ops` is the last OP of graph execution engine. Users can construct different DAGs by setting different `input_ops` without modifying the topology of OPs.
#### 6. ResponseOp Secondary Development Interface When the default ResponseOp cannot meet the requirements of the result return format, you can customize the return package packaging method by rewriting the following two interfaces.
| Interface or Variable | Explain | | Interface or Variable | Explain |
| :------------------------------------------: | :----------------------------------------------------------: | | :------------------------------------------: | :----------------------------------------------------------: |
...@@ -237,7 +280,7 @@ def pack_response_package(self, channeldata): ...@@ -237,7 +280,7 @@ def pack_response_package(self, channeldata):
return resp return resp
``` ```
#### 7. PipelineServer Definition #### 5. PipelineServer Definition
The definition of PipelineServer is relatively simple, as follows: The definition of PipelineServer is relatively simple, as follows:
...@@ -251,22 +294,137 @@ server.run_server() ...@@ -251,22 +294,137 @@ server.run_server()
Where `response_op` is the responseop mentioned above, PipelineServer will initialize Channels according to the topology relationship of each OP and build the calculation graph. `config_yml_path` is the configuration file of PipelineServer. The example file is as follows: Where `response_op` is the responseop mentioned above, PipelineServer will initialize Channels according to the topology relationship of each OP and build the calculation graph. `config_yml_path` is the configuration file of PipelineServer. The example file is as follows:
```yaml ```yaml
rpc_port: 18080 # gRPC port # gRPC port
worker_num: 1 # gRPC thread pool size (the number of processes in the process version servicer). The default is 1 rpc_port: 18080
build_dag_each_worker: false # Whether to use process server or not. The default is false
http_port: 0 # HTTP service port. Do not start HTTP service when the value is less or equals 0. The default value is 0. # http port, do not start HTTP service when the value is less or equals 0. The default value is 0.
http_port: 18071
# gRPC thread pool size (the number of processes in the process version servicer). The default is 1
worker_num: 1
# Whether to use process server or not. The default is false
build_dag_each_worker: false
dag: dag:
is_thread_op: true # Whether to use the thread version of OP. The default is true # Whether to use the thread version of OP. The default is true
client_type: brpc # Use brpc or grpc client. The default is brpc is_thread_op: true
retry: 1 # The number of times DAG executor retries after failure. The default value is 1, that is, no retrying
use_profile: false # Whether to print the log on the server side. The default is false # The number of times DAG executor retries after failure. The default value is 1, that is, no retrying
retry: 1
# Whether to print the log on the server side. The default is false
use_profile: false
# Monitoring time interval of Tracer (in seconds). Do not start monitoring when the value is less than 1. The default value is -1
tracer: tracer:
interval_s: 600 # Monitoring time interval of Tracer (in seconds). Do not start monitoring when the value is less than 1. The default value is -1 interval_s: 600
op:
bow:
# Concurrency, when is_thread_op=True, it's thread concurrency; otherwise, it's process concurrency
concurrency: 1
# Client types, brpc, grpc and local_predictor
client_type: brpc
# Retry times, no retry by default
retry: 1
# Prediction timeout, ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# Client config of bow model
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch list
fetch_list: ["prediction"]
# Batch size, default 1
batch_size: 1
# Batch query timeout
auto_batching_timeout: 2000
```
### 6. Special usages
#### 6.1 <b>Business custom error type</b>
Users can customize the error code according to the business, inherit ProductErrCode, and return it in the return list in Op's preprocess or postprocess. The next stage of processing will skip the post OP processing based on the custom error code.
```python
class ProductErrCode(enum.Enum):
"""
ProductErrCode is a base class for recording business error code.
product developers inherit this class and extend more error codes.
"""
pass
```
#### <b>6.2 Skip process stage</b>
The 2rd result of the result list returned by preprocess is `is_skip_process=True`, indicating whether to skip the process stage of the current OP and directly enter the postprocess processing
```python
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
self._log(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func."))
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict, False, None, ""
``` ```
#### <b>6.3 Custom proto Request and Response</b>
When the default proto structure does not meet the business requirements, at the same time, the Request and Response message structures of the proto in the following two files remain the same.
> pipeline/gateway/proto/gateway.proto
> pipeline/proto/pipeline_service.proto
Recompile Serving Server again.
#### <b>6.4 Custom URL</b>
## Example The grpc gateway processes post requests. The default `method` is `prediction`, for example: 127.0.0.1:8080/ocr/prediction. Users can customize the name and method, and can seamlessly switch services with existing URLs.
```proto
service PipelineService {
rpc inference(Request) returns (Response) {
option (google.api.http) = {
post : "/{name=*}/{method=*}"
body : "*"
};
}
};
```
***
## ★ Classic examples
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure: Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
...@@ -277,7 +435,7 @@ Here, we build a simple imdb model enable example to show how to use Pipeline Se ...@@ -277,7 +435,7 @@ Here, we build a simple imdb model enable example to show how to use Pipeline Se
</center> </center>
### Get the model file and start the Paddle Serving Service ### 1. Get the model file and start the Paddle Serving Service
```shell ```shell
cd python/examples/pipeline/imdb_model_ensemble cd python/examples/pipeline/imdb_model_ensemble
...@@ -288,7 +446,83 @@ python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow. ...@@ -288,7 +446,83 @@ python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow.
PipelineServing also supports local automatic startup of PaddleServingService. Please refer to the example `python/examples/pipeline/ocr`. PipelineServing also supports local automatic startup of PaddleServingService. Please refer to the example `python/examples/pipeline/ocr`.
### Start PipelineServer ### 2. Create config.yaml
Because there is a lot of configuration information in config.yaml,, only part of the OP configuration is shown here. For full information, please refer to `python/examples/pipeline/imdb_model_ensemble/config.yaml`
```yaml
op:
bow:
# Concurrency, when is_thread_op=True, it's thread concurrency; otherwise, it's process concurrency
concurrency: 1
# Client types, brpc, grpc and local_predictor
client_type: brpc
# Retry times, no retry by default
retry: 1
# Predcition timeout, ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# Client config of bow model
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch list
fetch_list: ["prediction"]
# Batch request size, default 1
batch_size: 1
# Batch query timeout
auto_batching_timeout: 2000
cnn:
# Concurrency
concurrency: 1
# Client types, brpc, grpc and local_predictor
client_type: brpc
# Retry times, no retry by default
retry: 1
# Predcition timeout, ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9292"]
# Client config of cnn model
client_config: "imdb_cnn_client_conf/serving_client_conf.prototxt"
# Fetch list
fetch_list: ["prediction"]
# Batch request size, default 1
batch_size: 1
# Batch query timeout
auto_batching_timeout: 2000
combine:
# Concurrency
concurrency: 1
#R etry times, no retry by default
retry: 1
# Predcition timeout, ms
timeout: 3000
# Batch request size, default 1
batch_size: 1
# Batch query timeout, ms
auto_batching_timeout: 2000
### 3. Start PipelineServer
Run the following code Run the following code
...@@ -359,7 +593,7 @@ server.prepare_server('config.yml') ...@@ -359,7 +593,7 @@ server.prepare_server('config.yml')
server.run_server() server.run_server()
``` ```
### Perform prediction through PipelineClient ### 4. Perform prediction through PipelineClient
```python ```python
from paddle_serving_client.pipeline import PipelineClient from paddle_serving_client.pipeline import PipelineClient
...@@ -385,13 +619,16 @@ for f in futures: ...@@ -385,13 +619,16 @@ for f in futures:
exit(1) exit(1)
``` ```
***
## ★ Performance analysis
## How to optimize with the timeline tool ### 1. How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service. In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### Output profile information on server side ### 2. Output profile information on server side
The server is controlled by the `use_profile` field in yaml: The server is controlled by the `use_profile` field in yaml:
...@@ -418,8 +655,29 @@ if __name__ == "__main__": ...@@ -418,8 +655,29 @@ if __name__ == "__main__":
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service. Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### Output profile information on client side ### 3. Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side. The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server. After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 4. Analytical methods
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
...@@ -7,15 +7,47 @@ Paddle Serving 通常用于单模型的一键部署,但端到端的深度学 ...@@ -7,15 +7,47 @@ Paddle Serving 通常用于单模型的一键部署,但端到端的深度学
Paddle Serving 提供了用户友好的多模型组合服务编程框架,Pipeline Serving,旨在降低编程门槛,提高资源使用率(尤其是GPU设备),提升整体的预估效率。 Paddle Serving 提供了用户友好的多模型组合服务编程框架,Pipeline Serving,旨在降低编程门槛,提高资源使用率(尤其是GPU设备),提升整体的预估效率。
## 整体架构设计 ## 整体架构设计
Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示。 Server端基于<b>RPC服务层</b><b>图执行引擎</b>构建,两者的关系如下图所示。
<center> <center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/> <img src='pipeline_serving-image1.png' height = "250" align="middle"/>
</center> </center>
### 图执行引擎 </n>
### 1. RPC服务层
为满足用户不同的使用需求,RPC服务层同时启动1个Web服务器和1个RPC服务器,可同时处理RESTful API、gRPC 2种类型请求。gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service;gRPC请求由gRPC service接收,所以,2种类型的请求统一由gRPC Service处理,确保处理逻辑一致。
#### <b>1.1 proto的输入输出结构</b>
gRPC服务和gRPC gateway服务统一用service.proto生成。
```proto
message Request {
repeated string key = 1;
repeated string value = 2;
optional string name = 3;
optional string method = 4;
optional int64 logid = 5;
optional string clientip = 6;
};
message Response {
optional int32 err_no = 1;
optional string err_msg = 2;
repeated string key = 3;
repeated string value = 4;
};
```
Request中`key``value`是配对的string数组。 `name``method`对应RESTful API的URL://{ip}:{port}/{name}/{method}。`logid``clientip`便于用户串联服务级请求和自定义策略。
Response中`err_no``err_msg`表达处理结果的正确性和错误信息,`key``value`为返回结果。
### 2. 图执行引擎
图执行引擎由 OP 和 Channel 构成,相连接的 OP 之间会共享一个 Channel。 图执行引擎由 OP 和 Channel 构成,相连接的 OP 之间会共享一个 Channel。
...@@ -29,7 +61,7 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示 ...@@ -29,7 +61,7 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示
</center> </center>
### OP的设计 #### <b>2.1 OP的设计</b>
- 单个 OP 默认的功能是根据输入的 Channel 数据,访问一个 Paddle Serving 的单模型服务,并将结果存在输出的 Channel - 单个 OP 默认的功能是根据输入的 Channel 数据,访问一个 Paddle Serving 的单模型服务,并将结果存在输出的 Channel
- 单个 OP 可以支持用户自定义,包括 preprocess,process,postprocess 三个函数都可以由用户继承和实现 - 单个 OP 可以支持用户自定义,包括 preprocess,process,postprocess 三个函数都可以由用户继承和实现
...@@ -37,7 +69,7 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示 ...@@ -37,7 +69,7 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示
- 单个 OP 可以获取多个不同 RPC 请求的数据,以实现 Auto-Batching - 单个 OP 可以获取多个不同 RPC 请求的数据,以实现 Auto-Batching
- OP 可以由线程或进程启动 - OP 可以由线程或进程启动
### Channel的设计 #### <b>2.2 Channel的设计</b>
- Channel 是 OP 之间共享数据的数据结构,负责共享数据或者共享数据状态信息 - Channel 是 OP 之间共享数据的数据结构,负责共享数据或者共享数据状态信息
- Channel 可以支持多个OP的输出存储在同一个 Channel,同一个 Channel 中的数据可以被多个 OP 使用 - Channel 可以支持多个OP的输出存储在同一个 Channel,同一个 Channel 中的数据可以被多个 OP 使用
...@@ -47,8 +79,18 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示 ...@@ -47,8 +79,18 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示
<img src='pipeline_serving-image3.png' height = "500" align="middle"/> <img src='pipeline_serving-image3.png' height = "500" align="middle"/>
</center> </center>
#### <b>2.3 预测类型的设计</b>
- OP的预测类型(client_type)有3种类型,brpc、grpc和local_predictor
- brpc: 使用bRPC Client与远端的Serving服务网络交互,性能优于grpc
- grpc: 使用gRPC Client与远端的Serving服务网络交互,支持跨平台部署
- local_predictor: 本地服务内加载模型并完成预测,不需要与网络交互。支持多卡部署,和TensorRT高性能预测。
- 选型:
- 延时(越少越好): local_predict < brpc <= grpc
- 微服务: brpc或grpc模型分拆成独立服务,简化开发和部署复杂度,提升资源利用率
### 极端情况的考虑
#### <b>2.4 极端情况的考虑</b>
- 请求超时的处理 - 请求超时的处理
...@@ -65,9 +107,11 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示 ...@@ -65,9 +107,11 @@ Server端基于 gRPC 和图执行引擎构建,两者的关系如下图所示
- 对于 output buffer,可以采用和 input buffer 类似的处理方法,即调整 OP3 和 OP4 的并发数,使得 output buffer 的缓冲长度得到控制(output buffer 的长度取决于下游 OP 从 output buffer 获取数据的速度) - 对于 output buffer,可以采用和 input buffer 类似的处理方法,即调整 OP3 和 OP4 的并发数,使得 output buffer 的缓冲长度得到控制(output buffer 的长度取决于下游 OP 从 output buffer 获取数据的速度)
- 同时 Channel 中数据量不会超过 gRPC 的 `worker_num`,即线程池大小 - 同时 Channel 中数据量不会超过 gRPC 的 `worker_num`,即线程池大小
### 用户接口设计 ***
## ★ 详细设计
#### 1. 普通 OP 定义 ### 1. 普通 OP 定义
普通 OP 作为图执行引擎中的基本单元,其构造函数如下: 普通 OP 作为图执行引擎中的基本单元,其构造函数如下:
...@@ -77,11 +121,13 @@ def __init__(name=None, ...@@ -77,11 +121,13 @@ def __init__(name=None,
server_endpoints=[], server_endpoints=[],
fetch_list=[], fetch_list=[],
client_config=None, client_config=None,
client_type=None,
concurrency=1, concurrency=1,
timeout=-1, timeout=-1,
retry=1, retry=1,
batch_size=1, batch_size=1,
auto_batching_timeout=None) auto_batching_timeout=None,
local_service_handler=None)
``` ```
各参数含义如下 各参数含义如下
...@@ -90,17 +136,21 @@ def __init__(name=None, ...@@ -90,17 +136,21 @@ def __init__(name=None,
| :-------------------: | :----------------------------------------------------------: | | :-------------------: | :----------------------------------------------------------: |
| name | (str)用于标识 OP 类型的字符串,该字段必须全局唯一。 | | name | (str)用于标识 OP 类型的字符串,该字段必须全局唯一。 |
| input_ops | (list)当前 OP 的所有前继 OP 的列表。 | | input_ops | (list)当前 OP 的所有前继 OP 的列表。 |
| server_endpoints | (list)远程 Paddle Serving Service 的 endpoints 列表。如果不设置该参数,则不访问远程 Paddle Serving Service,即 不会执行 process 操作。 | | server_endpoints | (list)远程 Paddle Serving Service 的 endpoints 列表。如果不设置该参数,认为是local_precditor模式,从local_service_conf中读取配置。 |
| fetch_list | (list)远程 Paddle Serving Service 的 fetch 列表。 | | fetch_list | (list)远程 Paddle Serving Service 的 fetch 列表。 |
| client_config | (str)Paddle Serving Service 对应的 Client 端配置文件路径。 | | client_config | (str)Paddle Serving Service 对应的 Client 端配置文件路径。 |
| client_type | (str) 可选择brpc、grpc或local_predictor。local_predictor不启动Serving服务,进程内预测。 |
| concurrency | (int)OP 的并发数。 | | concurrency | (int)OP 的并发数。 |
| timeout | (int)process 操作的超时时间,单位为毫秒。若该值小于零,则视作不超时。 | | timeout | (int)process 操作的超时时间,单位为毫秒。若该值小于零,则视作不超时。 |
| retry | (int)超时重试次数。当该值为 1 时,不进行重试。 | | retry | (int)超时重试次数。当该值为 1 时,不进行重试。 |
| batch_size | (int)进行 Auto-Batching 的期望 batch_size 大小,由于构建 batch 可能超时,实际 batch_size 可能小于设定值。 | | batch_size | (int)进行 Auto-Batching 的期望 batch_size 大小,由于构建 batch 可能超时,实际 batch_size 可能小于设定值,默认为 1。 |
| auto_batching_timeout | (float)进行 Auto-Batching 构建 batch 的超时时间,单位为毫秒。 | | auto_batching_timeout | (float)进行 Auto-Batching 构建 batch 的超时时间,单位为毫秒。batch_size > 1时,要设置auto_batching_timeout,否则请求数量不足batch_size时会阻塞等待。 |
| local_service_handler | (object) local predictor handler,Op init()入参赋值 或 在Op init()中创建|
#### 2. 普通 OP二次开发接口 ### 2. 普通 OP二次开发接口
OP 二次开发的目的是满足业务开发人员控制OP处理策略。
| 变量或接口 | 说明 | | 变量或接口 | 说明 |
| :----------------------------------------------: | :----------------------------------------------------------: | | :----------------------------------------------: | :----------------------------------------------------------: |
...@@ -154,7 +204,7 @@ def init_op(self): ...@@ -154,7 +204,7 @@ def init_op(self):
需要**注意**的是,在线程版 OP 中,每个 OP 只会调用一次该函数,故加载的资源必须要求是线程安全的。 需要**注意**的是,在线程版 OP 中,每个 OP 只会调用一次该函数,故加载的资源必须要求是线程安全的。
#### 3. RequestOp 定义 ### 3. RequestOp 定义 与 二次开发接口
RequestOp 用于处理 Pipeline Server 接收到的 RPC 数据,处理后的数据将会被加入到图执行引擎中。其构造函数如下: RequestOp 用于处理 Pipeline Server 接收到的 RPC 数据,处理后的数据将会被加入到图执行引擎中。其构造函数如下:
...@@ -162,7 +212,7 @@ RequestOp 用于处理 Pipeline Server 接收到的 RPC 数据,处理后的数 ...@@ -162,7 +212,7 @@ RequestOp 用于处理 Pipeline Server 接收到的 RPC 数据,处理后的数
def __init__(self) def __init__(self)
``` ```
#### 4. RequestOp 二次开发接口 当默认的RequestOp无法满足参数解析需求时,可通过重写下面2个接口自定义请求参数解析方法。
| 变量或接口 | 说明 | | 变量或接口 | 说明 |
| :---------------------------------------: | :----------------------------------------: | | :---------------------------------------: | :----------------------------------------: |
...@@ -186,7 +236,7 @@ def unpack_request_package(self, request): ...@@ -186,7 +236,7 @@ def unpack_request_package(self, request):
要求返回值是一个字典类型。 要求返回值是一个字典类型。
#### 5. ResponseOp 定义 #### 4. ResponseOp 定义 与 二次开发接口
ResponseOp 用于处理图执行引擎的预测结果,处理后的数据将会作为 Pipeline Server 的RPC 返回值,其构造函数如下: ResponseOp 用于处理图执行引擎的预测结果,处理后的数据将会作为 Pipeline Server 的RPC 返回值,其构造函数如下:
...@@ -196,7 +246,7 @@ def __init__(self, input_ops) ...@@ -196,7 +246,7 @@ def __init__(self, input_ops)
其中,`input_ops` 是图执行引擎的最后一个 OP,用户可以通过设置不同的 `input_ops` 以在不修改 OP 的拓扑关系下构造不同的 DAG。 其中,`input_ops` 是图执行引擎的最后一个 OP,用户可以通过设置不同的 `input_ops` 以在不修改 OP 的拓扑关系下构造不同的 DAG。
#### 6. ResponseOp 二次开发接口 当默认的 ResponseOp 无法满足结果返回格式要求时,可通过重写下面2个接口自定义返回包打包方法。
| 变量或接口 | 说明 | | 变量或接口 | 说明 |
| :------------------------------------------: | :-----------------------------------------: | | :------------------------------------------: | :-----------------------------------------: |
...@@ -235,7 +285,7 @@ def pack_response_package(self, channeldata): ...@@ -235,7 +285,7 @@ def pack_response_package(self, channeldata):
return resp return resp
``` ```
#### 7. PipelineServer定义 #### 5. PipelineServer定义
PipelineServer 的定义比较简单,如下所示: PipelineServer 的定义比较简单,如下所示:
...@@ -249,22 +299,134 @@ server.run_server() ...@@ -249,22 +299,134 @@ server.run_server()
其中,`response_op` 为上面提到的 ResponseOp,PipelineServer 将会根据各个 OP 的拓扑关系初始化 Channel 并构建计算图。`config_yml_path` 为 PipelineServer 的配置文件,示例文件如下: 其中,`response_op` 为上面提到的 ResponseOp,PipelineServer 将会根据各个 OP 的拓扑关系初始化 Channel 并构建计算图。`config_yml_path` 为 PipelineServer 的配置文件,示例文件如下:
```yaml ```yaml
rpc_port: 18080 # gRPC端口号 # gRPC端口号
worker_num: 1 # gRPC线程池大小(进程版 Servicer 中为进程数),默认为 1 rpc_port: 18080
build_dag_each_worker: false # 是否使用进程版 Servicer,默认为 false
http_port: 0 # HTTP 服务的端口号,若该值小于或等于 0 则不开启 HTTP 服务,默认为 0 # http端口号,若该值小于或等于 0 则不开启 HTTP 服务,默认为 0
http_port: 18071
# #worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
worker_num: 1
# 是否使用进程版 Servicer,默认为 false
build_dag_each_worker: false
dag: dag:
is_thread_op: true # 是否使用线程版Op,默认为 true # op资源类型, True, 为线程模型;False,为进程模型,默认为 True
client_type: brpc # 使用 brpc 或 grpc client,默认为 brpc is_thread_op: true
retry: 1 # DAG Executor 在失败后重试次数,默认为 1,即不重试
use_profile: false # 是否在 Server 端打印日志,默认为 false # DAG Executor 在失败后重试次数,默认为 1,即不重试
retry: 1
# 是否在 Server 端打印日志,默认为 false
use_profile: false
# 跟踪框架吞吐,每个OP和channel的工作情况。无tracer时不生成数据
tracer: tracer:
interval_s: 600 # Tracer 监控的时间间隔,单位为秒。当该值小于 1 时不启动监控,默认为 -1 interval_s: 600 # 监控的时间间隔,单位为秒。当该值小于 1 时不启动监控,默认为 -1
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# Serving交互超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。batch_size>1要设置 auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 1
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
``` ```
### 6. 特殊用法
#### 6.1 <b>业务自定义错误类型</b>
用户可根据业务场景自定义错误码,继承ProductErrCode,在Op的preprocess或postprocess中返回列表中返回,下一阶段处理会根据自定义错误码跳过后置OP处理。
```python
class ProductErrCode(enum.Enum):
"""
ProductErrCode is a base class for recording business error code.
product developers inherit this class and extend more error codes.
"""
pass
```
#### <b>6.2 跳过OP process阶段</b>
preprocess返回结果列表的第二个结果是`is_skip_process=True`表示是否跳过当前OP的process阶段,直接进入postprocess处理
```python
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
self._log(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func."))
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict, False, None, ""
```
#### <b>6.3 自定义proto Request 和 Response结构</b>
## 例子 当默认proto结构不满足业务需求时,同时下面2个文件的proto的Request和Response message结构,保持一致。
> pipeline/gateway/proto/gateway.proto
> pipeline/proto/pipeline_service.proto
再重新编译Serving Server。
#### <b>6.4 自定义URL</b>
grpc gateway处理post请求,默认`method``prediction`,例如:127.0.0.1:8080/ocr/prediction。用户可自定义name和method,对于已有url的服务可无缝切换
```proto
service PipelineService {
rpc inference(Request) returns (Response) {
option (google.api.http) = {
post : "/{name=*}/{method=*}"
body : "*"
};
}
};
```
***
## ★ 典型示例
这里通过搭建简单的 imdb model ensemble 例子来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示: 这里通过搭建简单的 imdb model ensemble 例子来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
...@@ -275,7 +437,7 @@ dag: ...@@ -275,7 +437,7 @@ dag:
</center> </center>
### 获取模型文件并启动 Paddle Serving Service ### 1. 获取模型文件并启动 Paddle Serving Service
```shell ```shell
cd python/examples/pipeline/imdb_model_ensemble cd python/examples/pipeline/imdb_model_ensemble
...@@ -286,9 +448,84 @@ python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow. ...@@ -286,9 +448,84 @@ python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow.
PipelineServing 也支持本地自动启动 PaddleServingService,请参考 `python/examples/pipeline/ocr` 下的例子。 PipelineServing 也支持本地自动启动 PaddleServingService,请参考 `python/examples/pipeline/ocr` 下的例子。
### 启动 PipelineServer ### 2. 创建config.yaml
由于config.yaml配置信息量很多,这里仅展示OP部分配置,全量信息参考`python/examples/pipeline/imdb_model_ensemble/config.yaml`
```yaml
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# Serving交互超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 1
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
cnn:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
运行下面代码 # Serving IPs
server_endpoints: ["127.0.0.1:9292"]
# cnn模型client端配置
client_config: "imdb_cnn_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。
batch_size: 1
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
combine:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
# 批量查询Serving的数量, 默认1。
batch_size: 1
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
```
### 3. 启动 PipelineServer
代码示例中,重点留意3个自定义Op的proprocess、postprocess处理,以及Combin Op初始化列表input_ops=[bow_op, cnn_op],设置Combin Op的前置OP列表。
```python ```python
from paddle_serving_server.pipeline import Op, RequestOp, ResponseOp from paddle_serving_server.pipeline import Op, RequestOp, ResponseOp
...@@ -356,7 +593,7 @@ server.prepare_server('config.yml') ...@@ -356,7 +593,7 @@ server.prepare_server('config.yml')
server.run_server() server.run_server()
``` ```
### 通过 PipelineClient 执行预测 ### 4. 通过 PipelineClient 执行预测
```python ```python
from paddle_serving_client.pipeline import PipelineClient from paddle_serving_client.pipeline import PipelineClient
...@@ -382,13 +619,16 @@ for f in futures: ...@@ -382,13 +619,16 @@ for f in futures:
exit(1) exit(1)
``` ```
***
## ★ 性能分析
## 如何通过 Timeline 工具进行优化 ### 1. 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。 为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 在 Server 端输出 Profile 信息 ### 2. 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制: Server 端用 yaml 中的 `use_profile` 字段进行控制:
...@@ -415,8 +655,29 @@ if __name__ == "__main__": ...@@ -415,8 +655,29 @@ if __name__ == "__main__":
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。 具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 在 Client 端输出 Profile 信息 ### 3. 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。 Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。 开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 4. 分析方法
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
Channel堆积:
channel_acc_size = QPS(down - up) * time
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
...@@ -2,6 +2,8 @@ ...@@ -2,6 +2,8 @@
([简体中文](RUN_IN_DOCKER_CN.md)|English) ([简体中文](RUN_IN_DOCKER_CN.md)|English)
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
## Requirements ## Requirements
Docker (GPU version requires nvidia-docker to be installed on the GPU machine) Docker (GPU version requires nvidia-docker to be installed on the GPU machine)
......
...@@ -2,6 +2,8 @@ ...@@ -2,6 +2,8 @@
(简体中文|[English](RUN_IN_DOCKER.md)) (简体中文|[English](RUN_IN_DOCKER.md))
Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。
## 环境要求 ## 环境要求
Docker(GPU版本需要在GPU机器上安装nvidia-docker) Docker(GPU版本需要在GPU机器上安装nvidia-docker)
......
numpy>=1.12, <=1.16.4 ; python_version<"3.5"
shapely==1.7.0
wheel>=0.34.0, <0.35.0
setuptools>=44.1.0
opencv-python==4.2.0.32
google>=2.0.3
opencv-python==4.2.0.32
protobuf>=3.12.2
grpcio-tools>=1.33.2
grpcio>=1.33.2
func-timeout>=4.3.5
pyyaml>=1.3.0
sentencepiece==0.1.83
flask>=1.1.2
ujson>=2.0.3
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册