未验证 提交 ff431cb8 编写于 作者: T TeslaZhao 提交者: GitHub

Merge branch 'PaddlePaddle:develop' into develop

# C++ Serving性能分析与优化
# 1.背景知识介绍
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
2) 关于C++ Serving框架本身的性能分析和介绍,请参考[C++ Serving框架性能测试](Frame_Performance_CN.md)
3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解,例如,您使用CPU还是GPU进行预测;是否可以开启TRT进行加速;你的机器CPU是多少core的;您的业务包含几个模型;每个模型的输入和输出需要做些什么处理;您业务的最大线上流量是多少;您的模型支持的最大输入batch是多少等等.
......
......@@ -57,3 +57,13 @@
- 更多Paddle Serving支持的部署模型请参考[wholechain](https://www.paddlepaddle.org.cn/wholechain)
- 最新模型可参考
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
- [PaddleRec](https://github.com/PaddlePaddle/PaddleRec)
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)
......@@ -57,3 +57,11 @@ Special thanks to the [Padddle wholechain](https://www.paddlepaddle.org.cn/whole
- Refer [wholechain](https://www.paddlepaddle.org.cn/wholechain) for more pre-trained models supported by PaddleServing
- Latest models refer
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
- [PaddleRec](https://github.com/PaddlePaddle/PaddleRec)
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)
# Pipeline Serving 性能优化
([English](./Performance_Tuning_EN.md)|简体中文)
## 1. 性能分析与优化
### 1.1 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 1.2 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制:
```yaml
dag:
use_profile: true
```
开启该功能后,Server 端在预测的过程中会将对应的日志信息打印到标准输出,为了更直观地展现各阶段的耗时,提供 Analyst 模块对日志文件做进一步的分析处理。
使用时先将 Server 的输出保存到文件,以 `profile.txt` 为例,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件,`trace` 文件可以通过 chrome 浏览器的 tracing 功能进行可视化。
```python
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 1.3 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 1.4 分析方法
根据pipeline.tracer日志中的各个阶段耗时,按以下公式逐步分析出主要耗时在哪个阶段。
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
Channel堆积:
channel_acc_size = QPS(down - up) * time
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 优化思路
根据长耗时在不同阶段,采用不同的优化方法.
- OP推理阶段(mid-process):
- 增加OP并发度
- 开启auto-batching(前提是多个请求的shape一致)
- 若批量数据中某条数据的shape很大,padding很大导致推理很慢,可使用mini-batch
- 开启TensorRT/MKL-DNN优化
- 开启低精度推理
- OP前处理阶段(pre-process):
- 增加OP并发度
- 优化前处理逻辑
- in/out耗时长(channel堆积>5)
- 检查channel传递的数据大小和延迟
- 优化传入数据,不传递数据或压缩后再传入
- 增加OP并发度
- 减少上游OP并发度
# Pipeline Serving Performance Optimization
(English|[简体中文](./Performance_Tuning_CN.md))
## 1. Performance analysis and optimization
### 1.1 How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### 1.2 Output profile information on server side
The server is controlled by the `use_profile` field in yaml:
```yaml
dag:
use_profile: true
```
After the function is enabled, the server will print the corresponding log information to the standard output in the process of prediction. In order to show the time consumption of each stage more intuitively, Analyst module is provided for further analysis and processing of log files.
The output of the server is first saved to a file. Taking `profile.txt` as an example, the script converts the time monitoring information in the log into JSON format and saves it to the `trace` file. The `trace` file can be visualized through the tracing function of Chrome browser.
```shell
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### 1.3 Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 1.4 Analytical methods
According to the time consumption of each stage in the pipeline.tracer log, the following formula is used to gradually analyze which stage is the main time consumption.
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 Optimization ideas
According to the long time consuming in stages below, different optimization methods are adopted.
- OP Inference stage(mid-process):
- Increase `concurrency`
- Turn on `auto-batching`(Ensure that the shapes of multiple requests are consistent)
- Use `mini-batch`, If the shape of data is very large.
- Turn on TensorRT for GPU
- Turn on MKLDNN for CPU
- Turn on low precison inference
- OP preprocess or postprocess stage:
- Increase `concurrency`
- Optimize processing logic
- In/Out stage(channel accumulation > 5):
- Check the size and delay of the data passed by the channel
- Optimize the channel to transmit data, do not transmit data or compress it before passing it in
- Increase `concurrency`
- Decrease `concurrency` upstreams.
......@@ -58,7 +58,7 @@ fetch_var {
## C++ Serving
### 1.快速启动
### 1.快速启动与关闭
可以通过配置模型及端口号快速启动服务,启动命令如下:
......@@ -106,6 +106,11 @@ python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
#### 当您想要关闭Serving服务时.
```BASH
python3 -m paddle_serving_server.serve stop
```
stop参数发送SIGINT至C++ Serving,若改成kill则发送SIGKILL信号至C++ Serving
### 2.自定义配置启动
......@@ -312,6 +317,20 @@ fetch_var {
## Python Pipeline
### 快速启动与关闭
Python Pipeline启动命令如下:
```BASH
python3 web_service.py
```
当您想要关闭Serving服务时.
```BASH
python3 -m paddle_serving_server.serve stop
```
stop参数发送SIGINT至Pipeline Serving,若改成kill则发送SIGKILL信号至Pipeline Serving
### 配置文件
Python Pipeline提供了用户友好的多模型组合服务编程框架,适用于多模型组合应用的场景。
其配置文件为YAML格式,一般默认为config.yaml。示例如下:
```YAML
......@@ -453,4 +472,4 @@ Python Pipeline支持低精度推理,CPU、GPU和TensoRT支持的精度类型
#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
precision: "fp32"
```
\ No newline at end of file
```
......@@ -58,7 +58,7 @@ fetch_var {
## C++ Serving
### 1. Quick start
### 1. Quick start and stop
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
......@@ -107,6 +107,11 @@ python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
#### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to C++ Serving. When setting `kill`, SIGKILL will be sent to C++ Serving
### 2. Starting with user-defined Configuration
......@@ -316,6 +321,19 @@ fetch_var {
## Python Pipeline
### Quick start and stop
Example starting Pipeline Serving:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to Pipeline Serving. When setting `kill`, SIGKILL will be sent to Pipeline Serving
### yml Configuration
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
Example of config.yaml:
......@@ -460,4 +478,4 @@ Python Pipeline supports low-precision inference. The precision types supported
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
precision: "fp32"
```
\ No newline at end of file
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册