未验证 提交 724bdcbc 编写于 作者: T TeslaZhao 提交者: GitHub

Merge pull request #1478 from TeslaZhao/v0.7.0

cherry-pick
......@@ -2,7 +2,7 @@
<p align="center">
<br>
<img src='doc/serving_logo.png' width = "600" height = "130">
<img src='doc/images/serving_logo.png' width = "600" height = "130">
<br>
<p>
......@@ -47,7 +47,7 @@ We consider deploying deep learning inference service online to be a user-facing
[Serving Examples](./python/examples/).
<p align="center">
<img src="doc/demo.gif" width="700">
<img src="doc/images/demo.gif" width="700">
</p>
......
......@@ -2,7 +2,7 @@
<p align="center">
<br>
<img src='doc/serving_logo.png' width = "600" height = "130">
<img src='doc/images/serving_logo.png' width = "600" height = "130">
<br>
<p>
......@@ -48,7 +48,7 @@ Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务
- 提供丰富多彩的前后处理,方便用户在训练、部署等各阶段复用相关代码,弥合AI开发者和应用开发者之间的鸿沟,详情参考[模型示例](./python/examples/)
<p align="center">
<img src="doc/demo.gif" width="700">
<img src="doc/images/demo.gif" width="700">
</p>
<h2 align="center">教程</h2>
......
......@@ -115,7 +115,7 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}]
We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
<!--
yum install -y libXext libSM libXrender
......
......@@ -111,4 +111,4 @@ curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}]
我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比,从用户配置的角度,采用相同的batch size和并发数进行压力测试,得到4块V100下的整体吞吐性能数据如下。
![4v100_bert_as_service_benchmark](4v100_bert_as_service_benchmark.png)
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
......@@ -88,7 +88,7 @@ this step is not necessary, but it can help you to verify if the model is ready.
```
if you succeed, you will see this
<p align="center">
<img src="cube-cli.png" width="700">
<img src="images/cube-cli.png" width="700">
</p>
If you see that each key has a corresponding value output, it means that the delivery was successful. This file can also be used by Serving to perform cube query in general kv infer op in Serving.
......
......@@ -91,7 +91,7 @@ cd cube
如果执行成功,会看到如下结果
<p align="center">
<img src="cube-cli.png" width="700">
<img src="images/cube-cli.png" width="700">
</p>
......
......@@ -39,7 +39,7 @@ Paddle Serving provides RPC and HTTP protocol for users. For HTTP service, we re
<p align="center">
<br>
<img src='user_groups.png' width = "700" height = "470">
<img src='images/user_groups.png' width = "700" height = "470">
<br>
<p>
......@@ -96,7 +96,7 @@ Distributed Sparse Parameter Indexing is commonly seen in advertising and recomm
<p align="center">
<br>
<img src='cube_eng.png' width = "450" height = "230">
<img src='images/cube_eng.png' width = "450" height = "230">
<br>
<p>
......@@ -116,7 +116,7 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
<p align="center">
<br>
<img src='design_doc.png'">
<img src='images/design_doc.png'">
<br>
<p>
......@@ -132,7 +132,7 @@ After sufficient offline evaluation of the model, online A/B test is usually nee
<p align="center">
<br>
<img src='abtest.png' width = "345" height = "230">
<img src='images/abtest.png' width = "345" height = "230">
<br>
<p>
......@@ -188,7 +188,7 @@ the end-to-end deep learning model can not solve all the problems at present. Us
### 5.1 Network Communication Mechanism
The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC service receives the RPC request, and the gPRC gateway receives the RESTful API request and forwards the request to the gRPC Service through the reverse proxy server. Therefore, the network layer of Pipeline Serving receives both RPC and RESTful API.
<center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
</center>
### 5.2 Core Design And Use Cases
......@@ -196,7 +196,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)
<center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
----
......
......@@ -42,7 +42,7 @@ Paddle Serving面向的用户提供RPC和HTTP两种访问协议。对于HTTP协
<p align="center">
<br>
<img src='user_groups.png' width = "700" height = "470">
<img src='images/user_groups.png' width = "700" height = "470">
<br>
<p>
......@@ -99,7 +99,7 @@ fetch_var {
为什么要使用Paddle Serving提供的分布式稀疏参数索引服务?1)在一些推荐场景中,模型的输入特征规模通常可以达到上千亿,单台机器无法支撑T级别模型在内存的保存,因此需要进行分布式存储。2)Paddle Serving提供的分布式稀疏参数索引服务,具有并发请求多个节点的能力,从而以较低的延时完成预估服务。
<p align="center">
<br>
<img src='cube_eng.png' width = "450" height = "230">
<img src='images/cube.png' width = "450" height = "230">
<br>
<p>
分布式稀疏参数索引通常在广告推荐中出现,并与分布式训练配合形成完整的离线-在线一体化部署。下图解释了其中的流程,产品的在线服务接受用户请求后将请求发送给预估服务,同时系统会记录用户的请求以进行相应的训练日志处理和拼接。离线分布式训练系统会针对流式产出的训练日志进行模型增量训练,而增量产生的模型会配送至分布式稀疏参数索引服务,同时对应的稠密的模型参数也会配送至在线的预估服务。在线服务由两部分组成,一部分是针对用户的请求提取特征后,将需要进行模型的稀疏参数索引的特征发送请求给分布式稀疏参数索引服务,针对分布式稀疏参数索引服务返回的稀疏参数再进行后续深度学习模型的计算流程,从而完成预估。
......@@ -118,7 +118,7 @@ C++ Serving采用[better-rpc](https://github.com/apache/incubator-brpc)进行底
C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点代表预估服务的一个环节,例如计算模型预测打分就是其中一个环节。有向无环图有利于可并发节点充分利用部署实例内的计算资源,缩短延时。一个例子,当同一份输入需要送入两个不同的模型进行预估,并将两个模型预估的打分进行加权求和时,两个模型的打分过程即可以通过有向无环图的拓扑关系并发。
<p align="center">
<br>
<img src='design_doc.png'">
<img src='images/design_doc.png'">
<br>
<p>
......@@ -136,7 +136,7 @@ Paddle Serving采用对称加密算法对模型进行加密,在服务加载模
<p align="center">
<br>
<img src='abtest.png' width = "345" height = "230">
<img src='images/abtest.png' width = "345" height = "230">
<br>
<p>
......@@ -189,13 +189,13 @@ imdb_service.run_server()
### 5.1 网络框架
Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC请求,gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service。即,Pipeline Serving的网络层同时接收RPC和RESTful API。
<center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
</center>
### 5.2 核心设计与使用用例
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)
<center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
----
......
......@@ -18,7 +18,7 @@
使用gRPC接口,Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下:
![](https://github.com/PaddlePaddle/Serving/blob/develop/doc/grpc_impl.png)
![](images/grpc_impl.png)
## 1.与bRPC接口对比
......
文件模式从 100755 更改为 100644
# 单卡多模型预测服务
当客户端发送的请求数并不频繁的情况下,会造成服务端机器计算资源尤其是GPU资源的浪费,这种情况下,可以在服务端启动多个预测服务来提高资源利用率。Paddle Serving支持在单张显卡上部署多个预测服务,使用时只需要在启动单个服务时通过--gpu_ids参数将服务与显卡进行绑定,这样就可以将多个服务都绑定到同一张卡上。
例如:
```shell
python -m paddle_serving_server.serve --model bert_seq128_model --port 9292 --gpu_ids 0
python -m paddle_serving_server.serve --model ResNet50_vd_model --port 9393 --gpu_ids 0
```
在卡0上,同时部署了bert示例和iamgenet示例。
**注意:** 单张显卡内部进行推理计算时仍然为串行计算,这种方式是为了减少server端显卡的空闲时间。
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
......@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:
<center>
<img src='server_dag.png' width = "450" height = "500" align="middle"/>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
</center>
## How to define Node
......@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
<center>
<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center>
``` python
......@@ -51,7 +51,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows.
<center>
<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center>
```python
......
......@@ -9,7 +9,7 @@
深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下:
<center>
<img src='server_dag.png' width = "450" height = "500" align="middle"/>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/>
</center>
## 如何定义节点
......@@ -18,7 +18,7 @@
PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
<center>
<img src='simple_dag.png' width = "260" height = "370" align="middle"/>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center>
``` python
......@@ -50,7 +50,7 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。
<center>
<img src='complex_dag.png' width = "480" height = "400" align="middle"/>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center>
```python
......
......@@ -32,17 +32,17 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
其中我们之前serving容器 以 9393端口暴露,KONG网关的端口是8443, KONG的Web控制台的端口是8001。接下来我们在浏览器访问 `https://$IP_ADDR:8001`, 其中 IP_ADDR就是宿主机的IP。
<img src="kong-dashboard.png">
<img src="images/kong-dashboard.png">
可以看到在注册结束后,登陆,看到了 DASHBOARD,我们先看SERVICES,可以看到`serving_service`,这意味着我们端口在9393的Serving服务已经在KONG当中被注册。
<img src="kong-services.png">
<img src="kong-routes.png">
<img src="images/kong-services.png">
<img src="images/kong-routes.png">
然后在ROUTES中,我们可以看到 serving 被链接到了 `/serving-uci`
最后我们点击 CONSUMERS - default_user - Credentials - API KEYS ,我们可以看到 `Api Keys` 下看到很多key
<img src="kong-api_keys.png">
<img src="images/kong-api_keys.png">
接下来可以通过curl访问
......@@ -194,6 +194,3 @@ credentials:
curl -H "Content-Type:application/json" -H "apikey:ZGVmYXVsdC1hcGlrZXkK" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' https://$IP:$PORT/foo/uci/prediction -k
```
我们可以看到 apikey 已经加入到了curl请求的header当中。
# Deploy HTTP service with uWSGI
([简体中文](./UWSGI_DEPLOY_CN.md)|English)
In fit_a_line example, after starting the HTTP prediction service, you will see the following information:
```shell
web service address:
http://10.127.3.150:9393/uci/prediction
* Serving Flask app "serve" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://0.0.0.0:9393/ (Press CTRL+C to quit)
```
Here you will be prompted that the HTTP service started is in development mode and cannot be used for production deployment.
The prediction service started by Flask is not stable enough to withstand the concurrency of a large number of requests. In the actual deployment process, WSGI (Web Server Gateway Interface) is used.
Next, we will show how to use the [uWSGI](https://github.com/unbit/uwsgi) module to deploy HTTP prediction services for production environments.
```python
#uwsgi_service.py
from paddle_serving_server.web_service import WebService
#Define prediction service
uci_service = WebService(name = "uci")
uci_service.load_model_config("./uci_housing_model")
uci_service.prepare_server(workdir="./workdir", port=int(9500), device="cpu")
uci_service.run_rpc_service()
#Get flask application
app_instance = uci_service.get_app_instance()
```
Start service with uWSGI
```bash
uwsgi --http :9393 --module uwsgi_service:app_instance
```
Use the --processes parameter to specify the number of service processes.
For more information about uWSGI, please refer to [uWSGI documentation](https://uwsgi-docs.readthedocs.io/en/latest/)
# 使用uwsgi启动HTTP预测服务
(简体中文|[English](./UWSGI_DEPLOY.md))
在提供的fit_a_line示例中,启动HTTP预测服务后会看到有以下信息:
```shell
web service address:
http://10.127.3.150:9393/uci/prediction
* Serving Flask app "serve" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://0.0.0.0:9393/ (Press CTRL+C to quit)
```
这里会提示启动的HTTP服务是开发模式,并不能用于生产环境的部署。Flask启动的服务环境不够稳定也无法承受大量请求的并发,实际部署过程中配合需要WSGI(Web Server Gateway Interface)使用。
下面我们展示一下如何使用[uWSGI](https://github.com/unbit/uwsgi)模块来部署HTTP预测服务用于生产环境。
编写HTTP服务脚本
```python
#uwsgi_service.py
from paddle_serving_server.web_service import WebService
#配置预测服务
uci_service = WebService(name = "uci")
uci_service.load_model_config("./uci_housing_model")
uci_service.prepare_server(workdir="./workdir", port=int(9500), device="cpu")
uci_service.run_rpc_service()
#获取flask服务
app_instance = uci_service.get_app_instance()
```
使用uwsgi启动HTTP服务
```bash
uwsgi --http :9393 --module uwsgi_service:app_instance
```
使用--processes参数可以指定服务的进程数。
更多uWSGI的信息请参考[uWSGI使用文档](https://uwsgi-docs.readthedocs.io/en/latest/)
......@@ -4,7 +4,7 @@
This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
<img src="abtest.png" style="zoom:25%;" />
<img src="images/abtest.png" style="zoom:25%;" />
Note that: A/B Test is only applicable to RPC mode, not web mode.
......
......@@ -4,7 +4,7 @@
该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。
<img src="abtest.png" style="zoom:33%;" />
<img src="images/abtest.png" style="zoom:33%;" />
需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。
......
......@@ -45,11 +45,11 @@ Models that can be predicted using the Paddle Inference Library, models saved du
### 3.4 Server Inferface
![Server Interface](server_interface.png)
![Server Interface](images/server_interface.png)
### 3.5 Client Interface
<img src='client_inferface.png' width = "600" height = "200">
<img src='images/client_inferface.png' width = "600" height = "200">
### 3.6 Client io used during Training
......@@ -66,7 +66,7 @@ def save_model(server_model_folder,
## 4. Paddle Serving Underlying Framework
![Paddle-Serging Overall Architecture](framework.png)
![Paddle-Serging Overall Architecture](images/framework.png)
**Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
**Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
......@@ -102,18 +102,18 @@ class FluidFamilyCore {
With reference to the abstract idea of model calculation of the TensorFlow framework, the business logic is abstracted into a DAG diagram, driven by configuration, generating a workflow, and skipping C ++ code compilation. Each specific step of the service corresponds to a specific OP. The OP can configure the upstream OP that it depends on. Unified message passing between OPs is achieved by the thread-level bus and channel mechanisms. For example, the service process of a simple prediction service can be abstracted into 3 steps including reading request data-> calling the prediction interface-> writing back the prediction result, and correspondingly implemented to 3 OP: ReaderOp-> ClassifyOp-> WriteOp
![Infer Service](predict-service.png)
![Infer Service](images/predict-service.png)
Regarding the dependencies between OPs, and the establishment of workflows through OPs, you can refer to [从零开始写一个预测服务](CREATING.md) (simplified Chinese Version)
Server instance perspective
![Server instance perspective](server-side.png)
![Server instance perspective](images/server-side.png)
#### 4.2.2 Paddle Serving Multi-Service Mechanism
![Paddle Serving multi-service](multi-service.png)
![Paddle Serving multi-service](images/multi-service.png)
Paddle Serving instances can load multiple models at the same time, and each model uses a Service (and its configured workflow) to undertake services. You can refer to [service configuration file in Demo example](../tools/cpp_examples/demo-serving/conf/service.prototxt) to learn how to configure multiple services for the serving instance
......@@ -121,12 +121,12 @@ Paddle Serving instances can load multiple models at the same time, and each mod
From the client's perspective, a Paddle Serving service can be divided into three levels: Service, Endpoint, and Variant from top to bottom.
![Call hierarchy relationship](multi-variants.png)
![Call hierarchy relationship](images/multi-variants.png)
One Service corresponds to one inference model, and there is one endpoint under the model. Different versions of the model are implemented through multiple variant concepts under endpoint:
The same model prediction service can configure multiple variants, and each variant has its own downstream IP list. The client code can configure relative weights for each variant to achieve the relationship of adjusting the traffic ratio (refer to the description of variant_weight_list in [Client Configuration](CLIENT_CONFIGURE.md) section 3.2).
![Client-side proxy function](client-side-proxy.png)
![Client-side proxy function](images/client-side-proxy.png)
## 5. User Interface
......
......@@ -47,11 +47,11 @@ PaddlePaddle是百度开源的机器学习框架,广泛支持各种深度学
### 3.4 Server Inferface
![Server Interface](server_interface.png)
![Server Interface](images/server_interface.png)
### 3.5 Client Interface
<img src='client_inferface.png' width = "600" height = "200">
<img src='images/client_inferface.png' width = "600" height = "200">
### 3.6 训练过程中使用的Client io
......@@ -68,7 +68,7 @@ def save_model(server_model_folder,
## 4. Paddle Serving底层框架
![Paddle-Serging总体框图](framework.png)
![Paddle-Serging总体框图](images/framework.png)
**模型管理框架**:对接多种机器学习平台的模型文件,向上提供统一的inference接口
**业务调度框架**:对各种不同预测模型的计算逻辑进行抽象,提供通用的DAG调度框架,通过DAG图串联不同的算子,共同完成一次预测服务。该抽象模型使用户可以方便的实现自己的计算逻辑,同时便于算子共用。(用户搭建自己的预测服务,很大一部分工作是搭建DAG和提供算子的实现)
......@@ -104,18 +104,18 @@ class FluidFamilyCore {
参考TF框架的模型计算的抽象思想,将业务逻辑抽象成DAG图,由配置驱动,生成workflow,跳过C++代码编译。业务的每个具体步骤,对应一个具体的OP,OP可配置自己依赖的上游OP。OP之间消息传递统一由线程级Bus和channel机制实现。例如,一个简单的预测服务的服务过程,可以抽象成读请求数据->调用预测接口->写回预测结果等3个步骤,相应的实现到3个OP: ReaderOp->ClassifyOp->WriteOp
![预测服务Service](predict-service.png)
![预测服务Service](images/predict-service.png)
关于OP之间的依赖关系,以及通过OP组建workflow,可以参考[从零开始写一个预测服务](CREATING.md)的相关章节
服务端实例透视图
![服务端实例透视图](server-side.png)
![服务端实例透视图](images/server-side.png)
#### 4.2.2 Paddle Serving的多服务机制
![Paddle Serving的多服务机制](multi-service.png)
![Paddle Serving的多服务机制](images/multi-service.png)
Paddle Serving实例可以同时加载多个模型,每个模型用一个Service(以及其所配置的workflow)承接服务。可以参考[Demo例子中的service配置文件](../tools/cpp_examples/demo-serving/conf/service.prototxt)了解如何为serving实例配置多个service
......@@ -123,12 +123,12 @@ Paddle Serving实例可以同时加载多个模型,每个模型用一个Servic
从客户端看,一个Paddle Serving service从顶向下可分为Service, Endpoint, Variant等3个层级
![调用层级关系](multi-variants.png)
![调用层级关系](images/multi-variants.png)
一个Service对应一个预测模型,模型下有1个endpoint。模型的不同版本,通过endpoint下多个variant概念实现:
同一个模型预测服务,可以配置多个variant,每个variant有自己的下游IP列表。客户端代码可以对各个variant配置相对权重,以达到调节流量比例的关系(参考[客户端配置](CLIENT_CONFIGURE.md)第3.2节中关于variant_weight_list的说明)。
![Client端proxy功能](client-side-proxy.png)
![Client端proxy功能](images/client-side-proxy.png)
## 5. 用户接口
......
此差异已折叠。
......@@ -26,7 +26,7 @@
第1) - 第5)步裁剪完毕后的模型网络配置如下:
![Pruned CTR prediction network](../pruned-ctr-network.png)
![Pruned CTR prediction network](../images/pruned-ctr-network.png)
整个裁剪过程具体说明如下:
......
......@@ -10,7 +10,7 @@ Next, we will take the text classification task as an example to show model ense
In this example (see the figure below), the server side predict the bow and CNN models with the same input in a service in parallel, The client side fetchs the prediction results of the two models, and processes the prediction results to get the final predict results.
![simple example](../model_ensemble_example.png)
![simple example](../images/model_ensemble_example.png)
It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.
......
......@@ -10,7 +10,7 @@
该样例中(见下图),Server端在一项服务中并行预测相同输入的BOW和CNN模型,Client端获取两个模型的预测结果并进行后处理,得到最终的预测结果。
![simple example](../model_ensemble_example.png)
![simple example](../images/model_ensemble_example.png)
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
......
# Multiple Serving Instances over Single GPU Card
Paddle Serving依托PaddlePaddle预测库执行实际的预测计算。由于当前GPU预测库的限制,单个Serving实例只可以绑定1张GPU卡,且进程内所有worker线程共用1个GPU stream。也就是说,不管Serving启动多少个worker线程,所有的请求在GPU是严格串行计算的,起不到加速作用。这会带来一个问题,就是如果模型计算量不大,那么Serving进程实际上不会用满GPU的算力。
为了充分利用GPU卡的算力,考虑在单张卡上启动多个Serving实例,通过多个GPU stream,力争用满GPU的算力。启动命令可以如下所示:
```
bin/serving --gpuid=0 --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8010&
bin/serving --gpuid=0 --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011&
```
上述2条命令,启动2个Serving实例,分别监听8010端口和8011端口。但他们都绑定同一张卡 (gpuid = 0)。
命令行参数含义:
```
-gpuid=N:用于指定所绑定的GPU卡ID
-bthread_concurrency和bthread_min_concurrency共同限制该进程启动的worker数:由于在GPU预测模式下,增加worker线程数并不能提高并发能力,为了节省部分资源,干脆将他们限制掉;均设为4,是因为这是bthread允许的最小值。
-port xxx:Serving实例监听的端口
```
但是,上述方式究竟是否能在不影响响应时间等其他指标的前提下,起到提高GPU使用率作用,受到多个限制因素的制约,具体的:
1. 单个stream占用GPU算力;假如单个stream已经将GPU算力占用超过50%,那么增加stream很可能会导致2个stream的job分别排队,拖慢各自的响应时间
2. GPU显存:Serving进程需要将模型参数加载到显存中,并且计算时要在GPU显存池分配临时变量;假如单个Serving进程已经用掉超过50%的显存,则增加Serving进程会造成显存不足,导致进程报错退出
为此,可采用如下步骤,进行测试:
1. 加载模型时,在model_toolkit.prototxt中,model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR;会对模型进行静态分析,进行一定程度显存优化
2. 在步骤1完成后,启动单个Serving进程,启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`;启动一个client,进行并发度为1的压力测试,batch size从小到大,记下平响;由于算力的限制,当batch size增大到一定程度,应该会出现响应时间明显变大;或虽然没有明显变大,但已经不满足系统需求
3. 再启动1个Serving进程,与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口;然后同时对这2个Serving进程进行压测,继续观察batch size从小到大时平均响应时间的变化,直到取得batch size和响应时间的折中
4. 重复步骤2-3
5. 以2-4步的测试,来决定:单张GPU卡可以由多少个Serving进程共用; 实际部署时,就在一张GPU卡上启动这么多个Serving进程同时提供服务
# How to develop a new Web service?
([简体中文](NEW_WEB_SERVICE_CN.md)|English)
This document will take the image classification service based on the Imagenet data set as an example to introduce how to develop a new web service. The complete code can be visited at [here](../../python/examples/imagenet/resnet50_web_service.py).
## WebService base class
Paddle Serving implements the [WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L23) base class. You need to override its `preprocess` and `postprocess` method. The default implementation is as follows:
```python
class WebService(object):
def preprocess(self, feed={}, fetch=[]):
return feed, fetch
def postprocess(self, feed={}, fetch=[], fetch_map=None):
return fetch_map
```
### preprocess
The preprocess method has two input parameters, `feed` and `fetch`. For an HTTP request `request`:
- The value of `feed` is the feed part `request.json["feed"]` in the request data
- The value of `fetch` is the fetch part `request.json["fetch"]` in the request data
The return values are the feed and fetch values used in the prediction.
### postprocess
The postprocess method has three input parameters, `feed`, `fetch` and `fetch_map`:
- The value of `feed` is the feed part `request.json["feed"]` in the request data
- The value of `fetch` is the fetch part `request.json["fetch"]` in the request data
- The value of `fetch_map` is the model output value.
The return value will be processed as `{"reslut": fetch_map}` as the return of the HTTP request.
## Develop ImageService class
```python
class ImageService(WebService):
def preprocess(self, feed={}, fetch=[]):
reader = ImageReader()
feed_batch = []
for ins in feed:
if "image" not in ins:
raise ("feed data error!")
sample = base64.b64decode(ins["image"])
img = reader.process_image(sample)
feed_batch.append({"image": img})
return feed_batch, fetch
```
For the above `ImageService`, only the `preprocess` method is rewritten to process the image data in Base64 format into the data format required by prediction.
# 如何开发一个新的Web Service?
(简体中文|[English](NEW_WEB_SERVICE.md))
本文档将以Imagenet图像分类服务为例,来介绍如何开发一个新的Web Service。您可以在[这里](../../python/examples/imagenet/resnet50_web_service.py)查阅完整的代码。
## WebService基类
Paddle Serving实现了[WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L23)基类,您需要重写它的`preprocess`方法和`postprocess`方法,默认实现如下:
```python
class WebService(object):
def preprocess(self, feed={}, fetch=[]):
return feed, fetch
def postprocess(self, feed={}, fetch=[], fetch_map=None):
return fetch_map
```
### preprocess方法
preprocess方法有两个输入参数,`feed``fetch`。对于一个HTTP请求`request`
- `feed`的值为请求数据中的feed部分`request.json["feed"]`
- `fetch`的值为请求数据中的fetch部分`request.json["fetch"]`
返回值分别是预测过程中用到的feed和fetch值。
### postprocess方法
postprocess方法有三个输入参数,`feed``fetch``fetch_map`
- `feed`的值为请求数据中的feed部分`request.json["feed"]`
- `fetch`的值为请求数据中的fetch部分`request.json["fetch"]`
- `fetch_map`的值为fetch到的模型输出值
返回值将会被处理成`{"reslut": fetch_map}`作为HTTP请求的返回。
## 开发ImageService类
```python
class ImageService(WebService):
def preprocess(self, feed={}, fetch=[]):
reader = ImageReader()
feed_batch = []
for ins in feed:
if "image" not in ins:
raise ("feed data error!")
sample = base64.b64decode(ins["image"])
img = reader.process_image(sample)
feed_batch.append({"image": img})
return feed_batch, fetch
```
对于上述的`ImageService`,只重写了前处理方法,将base64格式的图片数据处理成模型预测需要的数据格式。
BERT_10_MINS.md
ABTEST_IN_PADDLE_SERVING.md
......@@ -18,7 +18,7 @@ Paddle Serving provides a user-friendly programming framework for multi-model co
The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
<div align=center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
</div>
### 1.1 RPC Service
......@@ -61,7 +61,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.
<div align=center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</div>
......@@ -80,7 +80,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.
<div align=center>
<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
</div>
......@@ -323,7 +323,7 @@ All examples of pipelines are in [examples/pipeline/](../python/examples/pipelin
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
<div align=center>
<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
</div>
### 3.1 Files required for pipeline deployment
......
......@@ -20,7 +20,7 @@ Paddle Serving提供了用户友好的多模型组合服务编程框架,Pipeli
Server端基于<b>RPC服务层</b><b>图执行引擎</b>构建,两者的关系如下图所示。
<div align=center>
<img src='pipeline_serving-image1.png' height = "250" align="middle"/>
<img src='images/pipeline_serving-image1.png' height = "250" align="middle"/>
</div>
</n>
......@@ -65,7 +65,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息,`
- 对于 OP 之间需要传输过大数据的情况,可以考虑 RAM DB 外存进行全局存储,通过在 Channel 中传递索引的 Key 来进行数据传输
<div align=center>
<img src='pipeline_serving-image2.png' height = "300" align="middle"/>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</div>
......@@ -84,7 +84,7 @@ Response中`err_no`和`err_msg`表达处理结果的正确性和错误信息,`
- 下图为图执行引擎中 Channel 的设计,采用 input buffer 和 output buffer 进行多 OP 输入或多 OP 输出的数据对齐,中间采用一个 Queue 进行缓冲
<div align=center>
<img src='pipeline_serving-image3.png' height = "500" align="middle"/>
<img src='images/pipeline_serving-image3.png' height = "500" align="middle"/>
</div>
#### <b>1.2.3 预测类型的设计</b>
......@@ -328,7 +328,7 @@ class ResponseOp(Op):
以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
<div align=center>
<img src='pipeline_serving-image4.png' height = "200" align="middle"/>
<img src='images/pipeline_serving-image4.png' height = "200" align="middle"/>
</div>
### 3.1 Pipeline部署需要的文件
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册