提交 720bc286 编写于 作者: T TeslaZhao

update PIPELINE_SERVING.md PIPELINE_SERVING_CN.md

上级 977beeec
......@@ -2,12 +2,12 @@
([简体中文](PIPELINE_SERVING_CN.md)|English)
- [Architecture Design](PIPELINE_SERVING.md#1.Architecture_Design)
- [Detailed Design](PIPELINE_SERVING.md#2.Detailed_Design)
- [Classic Examples](PIPELINE_SERVING.md#3.Classic_Examples)
- [Advanced Usages](PIPELINE_SERVING.md#4.Advanced_Usages)
- [Log Tracing](PIPELINE_SERVING.md#5.Log_Tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6.Performance_analysis_and_optimization)
- [Architecture Design](PIPELINE_SERVING.md#1architecture-design)
- [Detailed Design](PIPELINE_SERVING.md#2detailed-design)
- [Classic Examples](PIPELINE_SERVING.md#3classic-examples)
- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages)
- [Log Tracing](PIPELINE_SERVING.md#5log-tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
......@@ -17,10 +17,9 @@ Paddle Serving provides a user-friendly programming framework for multi-model co
The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
<center>
<div align=center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
</center>
</div>
### 1.1 RPC Service
......@@ -61,9 +60,9 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- After Request data enters the graph execution engine service, the graph engine will generator an Request ID, and Reponse is returned through corresponding Request ID.
- For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.
<center>
<div align=center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
</div>
#### <b>1.2.1 OP Design</b>
......@@ -80,9 +79,9 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- Outputs from multiple OPs can be stored in the same Channel, and data from the same Channel can be used by multiple OPs.
- The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.
<center>
<div align=center>
<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
</center>
</div>
#### <b>1.2.3 client type design</b>
......@@ -111,6 +110,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- For input buffer, adjust the number of concurrencies of OP1 and OP2 according to the amount of computation, so that the number of input buffers from each input OP is relatively balanced. (The length of the input buffer depends on the speed at which each item in the internal queue is ready)
- For output buffer, you can use a similar process as input buffer, which adjusts the concurrency of OP3 and OP4 to control the buffer length of output buffer. (The length of the output buffer depends on the speed at which downstream OPs obtain data from the output buffer)
- The amount of data in the Channel will not exceed `worker_num` of gRPC, that is, it will not exceed the thread pool size.
***
## 2.Detailed Design
......@@ -322,9 +322,9 @@ All examples of pipelines are in [examples/pipeline/](../python/examples/pipelin
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
<center>
<div align=center>
<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
</center>
</div>
### 3.1 Files required for pipeline deployment
......
......@@ -2,12 +2,12 @@
(简体中文|[English](PIPELINE_SERVING.md))
- [架构设计](PIPELINE_SERVING_CN.md#1.架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2.详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3.典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4.高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5.日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6.性能优化)
- [架构设计](PIPELINE_SERVING_CN.md#1架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
......@@ -19,9 +19,9 @@ Paddle Serving提供了用户友好的多模型组合服务编程框架,Pipeli
Server端基于<b>RPC服务层</b><b>图执行引擎</b>构建,两者的关系如下图所示。
<center>
<div align=center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
</center>
</div>
</n>
......@@ -64,9 +64,9 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息,`
- Request 进入图执行引擎服务后会产生一个 Request Id,Reponse 会通过 Request Id 进行对应的返回
- 对于 OP 之间需要传输过大数据的情况,可以考虑 RAM DB 外存进行全局存储,通过在 Channel 中传递索引的 Key 来进行数据传输
<center>
<div align=center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
</div>
#### <b>1.2.1 OP的设计</b>
......@@ -83,9 +83,9 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息,`
- Channel 可以支持多个OP的输出存储在同一个 Channel,同一个 Channel 中的数据可以被多个 OP 使用
- 下图为图执行引擎中 Channel 的设计,采用 input buffer 和 output buffer 进行多 OP 输入或多 OP 输出的数据对齐,中间采用一个 Queue 进行缓冲
<center>
<div align=center>
<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
</center>
</div>
#### <b>1.2.3 预测类型的设计</b>
......@@ -179,9 +179,7 @@ def __init__(name=None,
### 2.3 重写OP前后处理
OP 二次开发的目的是满足业务开发人员控制OP处理策略。
| 变量或接口 | 说明 |
......@@ -319,6 +317,7 @@ class ResponseOp(Op):
## 3.典型示例
所有Pipeline示例在[examples/pipeline/](../python/examples/pipeline) 目录下,目前有7种类型模型示例:
- [PaddleClas](../python/examples/pipeline/PaddleClas)
- [Detection](../python/examples/pipeline/PaddleDetection)
- [bert](../python/examples/pipeline/bert)
- [imagenet](../python/examples/pipeline/imagenet)
......@@ -327,9 +326,10 @@ class ResponseOp(Op):
- [simple_web_service](../python/examples/pipeline/simple_web_service)
以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
<center>
<div align=center>
<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
</center>
</div>
### 3.1 Pipeline部署需要的文件
需要五类文件,其中模型文件、配置文件、服务端代码是构建Pipeline服务必备的三个文件。测试客户端和测试数据集为测试准备
......@@ -642,7 +642,7 @@ Pipeline支持批量推理,通过增大batch size可以提高GPU利用率。Pi
- 指定一个块大小,从而缩小"极大"尺寸数据的作用范围
- 场景3:合并多个请求数据批量推理(auto-batching)
- 推理耗时明显长于前后处理,合并多个请求数据推理一次会提高吞吐和GPU利用率
- 要求多个request的数据的shape一致
- 要求多个request的数据的shape一致
| 接口 | 说明 |
| :------------------------------------------: | :-----------------------------------------: |
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册