未验证 提交 1410fea6 编写于 作者: J Jiawei Wang 提交者: GitHub

Merge pull request #1511 from bjjwwang/v0.7.0

cherry-pick 23 PR
此差异已折叠。
此差异已折叠。
...@@ -54,7 +54,7 @@ Hwvideoframe provides a variety of data preprocessing methods for photo preproce ...@@ -54,7 +54,7 @@ Hwvideoframe provides a variety of data preprocessing methods for photo preproce
## Quick start ## Quick start
[After compiling from code](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md),this project will be stored in reader。 [After compiling from code](../../../doc/Compile_EN.md),this project will be stored in reader。
## How to Test ## How to Test
......
## Build Bert-As-Service in 10 minutes
([简体中文](./BERT_10_MINS_CN.md)|English)
The goal of Bert-As-Service is to give a sentence, and the service can represent the sentence as a semantic vector and return it to the user. [Bert model](https://arxiv.org/abs/1810.04805) is a popular model in the current NLP field. It has achieved good results on a variety of public NLP tasks. The semantic vector calculated by the Bert model is used as input to other NLP models, which will also greatly improve the performance of the model. Bert-As-Service allows users to easily obtain the semantic vector representation of text and apply it to their own tasks. In order to achieve this goal, we have shown in five steps that using Paddle Serving can build such a service in ten minutes. All the code and files in the example can be found in [Example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) of Paddle Serving.
If your python version is 3.X, replace the 'pip' field in the following command with 'pip3',replace 'python' with 'python3'.
### Step1: Getting Model
#### method 1:
This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
Install paddlehub first
```
pip install paddlehub
```
run
```
python prepare_model.py 128
```
**PaddleHub only support Python 3.5+**
the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq128_model.
the config file generated for client side is saved in the folder bert_seq128_client.
#### method 2:
You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2: Getting Dict and Sample Dataset
```
sh get_data.sh
```
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
### Step3: Launch Service
start cpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #cpu inference service
```
Or,start gpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
```
| Parameters | Meaning |
| ---------- | ---------------------------------------- |
| model | server configuration and model file path |
| thread | server-side threads |
| port | server port number |
| gpu_ids | GPU index number |
### Step4: data preprocessing logic on Client Side
Paddle Serving has many built-in corresponding data preprocessing logics. For the calculation of Chinese Bert semantic representation, we use the ChineseBertReader class under paddle_serving_app for data preprocessing. Model input fields of multiple models corresponding to a raw Chinese sentence can be easily fetched by developers
Install paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5: Client Visit Serving
#### method 1: RPC Inference
Run
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
#### method 2: HTTP Inference
This method is divided into two steps:
1. Start an HTTP prediction server.
start cpu HTTP inference service,Run
```
python bert_web_service.py bert_seq128_model/ 9292 #launch cpu inference service
```
Or,start gpu HTTP inference service,Run
```
export CUDA_VISIBLE_DEVICES=0,1
```
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #launch gpu inference service
```
2. Prediction via HTTP request
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### Benchmark
We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
<!--
yum install -y libXext libSM libXrender
pip install paddlehub paddle_serving_server paddle_serving_client
sh pip_app.sh
python bert_10.py
sh server.sh &
wget https://paddle-serving.bj.bcebos.com/bert_example/data-c.txt --no-check-certificate
head -n 500 data-c.txt > data.txt
cat data.txt | python bert_client.py
if [[ $? -eq 0 ]]; then
echo "test success"
else
echo "test fail"
fi
ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
-->
## 十分钟构建Bert-As-Service
(简体中文|[English](./BERT_10_MINS.md))
Bert-As-Service的目标是给定一个句子,服务可以将句子表示成一个语义向量返回给用户。[Bert模型](https://arxiv.org/abs/1810.04805)是目前NLP领域的热门模型,在多种公开的NLP任务上都取得了很好的效果,使用Bert模型计算出的语义向量来做其他NLP模型的输入对提升模型的表现也有很大的帮助。Bert-As-Service可以让用户很方便地获取文本的语义向量表示并应用到自己的任务中。为了实现这个目标,我们通过以下几个步骤说明使用Paddle Serving在十分钟内就可以搭建一个这样的服务。示例中所有的代码和文件均可以在Paddle Serving的[示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)中找到。
若使用python的版本为3.X, 将以下命令中的pip 替换为pip3, python替换为python3.
### Step1:获取模型
#### 方法1:
示例中采用[Paddlehub](https://github.com/PaddlePaddle/PaddleHub)中的[BERT中文模型](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel)
请先安装paddlehub
```
pip install paddlehub
```
执行
```
python prepare_model.py 128
```
参数128表示BERT模型中的max_seq_len,即预处理后的样本长度。
生成server端配置文件与模型文件,存放在bert_seq128_model文件夹。
生成client端配置文件,存放在bert_seq128_client文件夹。
#### 方法2:
您也可以从bos上直接下载上述模型(max_seq_len=128),解压后server端配置文件与模型文件存放在bert_chinese_L-12_H-768_A-12_model文件夹,client端配置文件存放在bert_chinese_L-12_H-768_A-12_client文件夹:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2:获取词典和样例数据
```
sh get_data.sh
```
脚本将下载中文词典vocab.txt和中文样例数据data-c.txt
### Step3:启动服务
启动cpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #启动cpu预测服务
```
或者,启动gpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
```
| 参数 | 含义 |
| ------- | -------------------------- |
| model | server端配置与模型文件路径 |
| thread | server端线程数 |
| port | server端端口号 |
| gpu_ids | GPU索引号 |
### Step4:客户端数据预处理逻辑
Paddle Serving内建了很多经典典型对应的数据预处理逻辑,对于中文Bert语义表示的计算,我们采用paddle_serving_app下的ChineseBertReader类进行数据预处理,开发者可以很容易获得一个原始的中文句子对应的多个模型输入字段。
安装paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5:客户端访问
#### 方法1:通过RPC方式执行预测
执行
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
启动client读取data-c.txt中的数据进行预测,预测结果为文本的向量表示(由于数据较多,脚本中没有将输出进行打印),server端的地址在脚本中修改。
#### 方法2:通过HTTP方式执行预测
该方式分为两步
1、启动一个HTTP预测服务端。
启动cpu HTTP预测服务,执行
```
python bert_web_service.py bert_seq128_model/ 9292 #启动CPU预测服务
```
或者,启动gpu HTTP预测服务,执行
```
export CUDA_VISIBLE_DEVICES=0,1
```
通过环境变量指定gpu预测服务使用的gpu,示例中指定索引为0和1的两块gpu
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #启动gpu预测服务
```
2、通过HTTP请求执行预测。
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### 性能测试
我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比,从用户配置的角度,采用相同的batch size和并发数进行压力测试,得到4块V100下的整体吞吐性能数据如下。
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
# 如何使用Paddle Serving做ABTEST # 如何使用Paddle Serving做ABTEST
(简体中文|[English](./ABTEST_IN_PADDLE_SERVING.md)) (简体中文|[English](./ABTest_EN.md))
该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。 该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。
<img src="images/abtest.png" style="zoom:33%;" /> <img src="../images/abtest.png" style="zoom:33%;" />
需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。 需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。
### 下载数据以及模型 ### 下载数据以及模型
``` shell ``` shell
cd Serving/python/examples/imdb cd Serving/examples/C++/imdb
sh get_data.sh sh get_data.sh
``` ```
...@@ -24,13 +24,13 @@ pip install Shapely ...@@ -24,13 +24,13 @@ pip install Shapely
```` ````
您可以直接运行下面的命令来处理数据。 您可以直接运行下面的命令来处理数据。
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py) [python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。 文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。
### 启动Server端 ### 启动Server端
这里采用[Docker方式](RUN_IN_DOCKER_CN.md)启动Server端服务。 这里采用[Docker方式](../Run_In_Docker_CN.md)启动Server端服务。
首先启动BOW Server,该服务启用`8000`端口: 首先启动BOW Server,该服务启用`8000`端口:
...@@ -62,7 +62,7 @@ exit ...@@ -62,7 +62,7 @@ exit
您可以直接使用下面的命令,进行ABTEST预测。 您可以直接使用下面的命令,进行ABTEST预测。
[python abtest_client.py](../python/examples/imdb/abtest_client.py) [python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
```python ```python
from paddle_serving_client import Client from paddle_serving_client import Client
......
# ABTEST in Paddle Serving # ABTEST in Paddle Serving
([简体中文](./ABTEST_IN_PADDLE_SERVING_CN.md)|English) ([简体中文](./ABTest_CN.md)|English)
This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below. This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
<img src="images/abtest.png" style="zoom:25%;" /> <img src="../images/abtest.png" style="zoom:25%;" />
Note that: A/B Test is only applicable to RPC mode, not web mode. Note that: A/B Test is only applicable to RPC mode, not web mode.
### Download Data and Models ### Download Data and Models
```shell ```shell
cd Serving/python/examples/imdb cd Serving/examples/C++/imdb
sh get_data.sh sh get_data.sh
``` ```
...@@ -25,13 +25,13 @@ pip install Shapely ...@@ -25,13 +25,13 @@ pip install Shapely
You can directly run the following command to process the data. You can directly run the following command to process the data.
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py) [python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
The Python code in the file will process the data `test_data/part-0` and write to the `processed.data` file. The Python code in the file will process the data `test_data/part-0` and write to the `processed.data` file.
### Start Server ### Start Server
Here, we [use docker](RUN_IN_DOCKER.md) to start the server-side service. Here, we [use docker](../Run_In_Docker_EN.md) to start the server-side service.
First, start the BOW server, which enables the `8000` port: First, start the BOW server, which enables the `8000` port:
...@@ -63,7 +63,7 @@ Before running, use `pip install paddle-serving-client` to install the paddle-se ...@@ -63,7 +63,7 @@ Before running, use `pip install paddle-serving-client` to install the paddle-se
You can directly use the following command to make abtest prediction. You can directly use the following command to make abtest prediction.
[python abtest_client.py](../python/examples/imdb/abtest_client.py) [python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
[//file]:#abtest_client.py [//file]:#abtest_client.py
``` python ``` python
...@@ -103,8 +103,8 @@ Due to different network conditions, the results of each prediction may be sligh ...@@ -103,8 +103,8 @@ Due to different network conditions, the results of each prediction may be sligh
``` ```
<!-- <!--
cp ../Serving/python/examples/imdb/get_data.sh . cp ../../examples/C++/imdb/get_data.sh .
cp ../Serving/python/examples/imdb/imdb_reader.py . cp ../../examples/C++/imdb/imdb_reader.py .
pip install -U paddle_serving_server pip install -U paddle_serving_server
pip install -U paddle_serving_client pip install -U paddle_serving_client
pip install -U paddlepaddle pip install -U paddlepaddle
......
# C++ Serving vs TensorFlow Serving 性能对比
# 1. 测试环境和说明
1) GPU型号:Tesla P4(7611 Mib)
2) Cuda版本:11.0
3) 模型:ResNet_v2_50
4) 为了测试异步合并batch的效果,测试数据中batch=1
5) [使用的测试代码和使用的数据集](../../examples/C++/PaddleClas/resnet_v2_50)
6) 下图中蓝色是C++ Serving,灰色为TF-Serving。
7) 折线图为QPS,数值越大表示每秒钟处理的请求数量越大,性能就越好。
8) 柱状图为平均处理时延,数值越大表示单个请求处理时间越长,性能就越差。
# 2. 同步模式
均使用同步模式,默认参数配置。
可以看出同步模型默认参数配置情况下,C++Serving QPS和平均时延指标均优于TF-Serving。
<p align="center">
<br>
<img src='../images/syn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | pd-serving | 111.336 | 89.787| tf-serving| 84.632| 118.13|
|30 |pd-serving |165.928 |180.761 |tf-serving |106.572 |281.473|
|50| pd-serving| 207.244| 241.211| tf-serving| 80.002 |624.959|
|70 |pd-serving |214.769 |325.894 |tf-serving |105.17 |665.561|
|100| pd-serving| 235.405| 424.759| tf-serving| 93.664 |1067.619|
|150 |pd-serving |239.114 |627.279 |tf-serving |86.312 |1737.848|
# 3. 异步模式
均使用异步模式,最大batch=32,异步线程数=2。
可以看出异步模式情况下,两者性能接近,但当Client端并发数达到70的时候,TF-Serving服务直接超时,而C++Serving能够正常返回结果。
同时,对比同步和异步模式可以看出,异步模式在请求batch数较小时,通过合并batch能够有效提高QPS和平均处理时延。
<p align="center">
<br>
<img src='../images/asyn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
|10| pd-serving| 130.631| 76.502| tf-serving |172.64 |57.916|
|30| pd-serving| 201.062| 149.168| tf-serving| 241.669| 124.128|
|50| pd-serving| 286.01| 174.764| tf-serving |278.744 |179.367|
|70| pd-serving| 313.58| 223.187| tf-serving| 298.241| 234.7|
|100| pd-serving| 323.369| 309.208| tf-serving| 0| ∞|
|150| pd-serving| 328.248| 456.933| tf-serving| 0| ∞|
...@@ -75,15 +75,16 @@ service ImageClassifyService { ...@@ -75,15 +75,16 @@ service ImageClassifyService {
#### 2.2.2 示例配置 #### 2.2.2 示例配置
关于Serving端的配置的详细信息,可以参考[Serving端配置](SERVING_CONFIGURE.md) 关于Serving端的配置的详细信息,可以参考[Serving端配置](../Serving_Configure_CN.md)
以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[设计文档](C++DESIGN_CN.md)) 以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[OP介绍](./OP_CN.md)[DAG介绍](./DAG_CN.md))
- 配置文件示例: - 配置文件示例:
**添加文件 serving/conf/service.prototxt** **添加文件 serving/conf/service.prototxt**
```shell ```shell![image](https://user-images.githubusercontent.com/16222477/141761999-e5b5016e-ca36-4479-82bf-a83fdb95f3c0.png)
services { services {
name: "ImageClassifyService" name: "ImageClassifyService"
workflows: "workflow1" workflows: "workflow1"
...@@ -310,7 +311,7 @@ api.thrd_finalize(); ...@@ -310,7 +311,7 @@ api.thrd_finalize();
api.destroy(); api.destroy();
``` ```
具体实现可参考paddle Serving提供的例子sdk-cpp/demo/ximage.cpp 具体实现可参考C++Serving提供的例子。sdk-cpp/demo/ximage.cpp
### 3.3 链接 ### 3.3 链接
...@@ -392,4 +393,4 @@ predictors { ...@@ -392,4 +393,4 @@ predictors {
} }
} }
``` ```
关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](CLIENT_CONFIGURE.md) 关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](./Client_Configure_CN.md)
# Server端的计算图 # Server端的计算图
(简体中文|[English](./SERVER_DAG.md)) (简体中文|[English](./DAG_EN.md))
本文档显示了Server端上计算图的概念。 如何使用PaddleServing内置运算符定义计算图。 还显示了一些顺序执行逻辑的示例。 本文档显示了Server端上计算图的概念。 如何使用PaddleServing内置运算符定义计算图。 还显示了一些顺序执行逻辑的示例。
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下: 深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下:
<center> <center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/> <img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center> </center>
## 如何定义节点 ## 如何定义节点
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。 PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
<center> <center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/> <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center> </center>
``` python ``` python
...@@ -47,10 +47,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po ...@@ -47,10 +47,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### 包含多个输入的节点 ### 包含多个输入的节点
[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。 [Paddle Serving中的集成预测](./Model_Ensemble_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。
<center> <center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/> <img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center> </center>
```python ```python
......
# Computation Graph On Server # Computation Graph On Server
([简体中文](./SERVER_DAG_CN.md)|English) ([简体中文](./DAG_CN.md)|English)
This document shows the concept of computation graph on server. How to define computation graph with PaddleServing built-in operators. Examples for some sequential execution logics are shown as well. This document shows the concept of computation graph on server. How to define computation graph with PaddleServing built-in operators. Examples for some sequential execution logics are shown as well.
...@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co ...@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows: Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:
<center> <center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/> <img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center> </center>
## How to define Node ## How to define Node
...@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc ...@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows. PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
<center> <center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/> <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center> </center>
``` python ``` python
...@@ -48,10 +48,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po ...@@ -48,10 +48,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### Nodes with multiple inputs ### Nodes with multiple inputs
An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows. An example containing multiple input nodes is given in the [Model_Ensemble](./Model_Ensemble_EN.md). A example graph and the corresponding DAG definition code is as follows.
<center> <center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/> <img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center> </center>
```python ```python
......
# 加密模型预测 # 加密模型预测
(简体中文|[English](ENCRYPTION.md)) (简体中文|[English](./Encryption_EN.md))
Padle Serving提供了模型加密预测功能,本文档显示了详细信息。 Padle Serving提供了模型加密预测功能,本文档显示了详细信息。
...@@ -12,7 +12,7 @@ Padle Serving提供了模型加密预测功能,本文档显示了详细信息 ...@@ -12,7 +12,7 @@ Padle Serving提供了模型加密预测功能,本文档显示了详细信息
普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。 普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。
我们提供了一个简单的演示来加密模型。请参阅[`python/examples/encryption/encrypt.py`](../python/examples/encryption/encrypt.py) 我们提供了一个简单的演示来加密模型。请参阅[examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### 启动加密服务 ### 启动加密服务
...@@ -40,5 +40,4 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_ ...@@ -40,5 +40,4 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
### 模型加密推理示例 ### 模型加密推理示例
模型加密推理示例, 请参见[`/python/examples/encryption/`](../python/examples/encryption/) 模型加密推理示例, 请参见[examples/C++/encryption/](../../examples/C++/encryption/)
# MOEDL ENCRYPTION INFERENCE # MOEDL ENCRYPTION INFERENCE
([简体中文](ENCRYPTION_CN.md)|English) ([简体中文](./Encryption_CN.md)|English)
Paddle Serving provides model encryption inference, This document shows the details. Paddle Serving provides model encryption inference, This document shows the details.
...@@ -12,7 +12,7 @@ We use symmetric encryption algorithm to encrypt the model. Symmetric encryption ...@@ -12,7 +12,7 @@ We use symmetric encryption algorithm to encrypt the model. Symmetric encryption
Normal model and parameters can be understood as a string, by using the encryption algorithm (parameter is your key) on them, the normal model and parameters become an encrypted one. Normal model and parameters can be understood as a string, by using the encryption algorithm (parameter is your key) on them, the normal model and parameters become an encrypted one.
We provide a simple demo to encrypt the model. See the [python/examples/encryption/encrypt.py](../python/examples/encryption/encrypt.py) We provide a simple demo to encrypt the model. See the [examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### Start Encryption Service ### Start Encryption Service
...@@ -40,5 +40,4 @@ Once the server gets the key, it uses the key to parse the model and starts the ...@@ -40,5 +40,4 @@ Once the server gets the key, it uses the key to parse the model and starts the
### Example of Model Encryption Inference ### Example of Model Encryption Inference
Example of model encryption inference, See the [`/python/examples/encryption/`](../python/examples/encryption/) Example of model encryption inference, See the [examples/C++/encryption/](../../examples/C++/encryption/)
此差异已折叠。
# Paddle Serving中的模型热加载 # Paddle Serving中的模型热加载
(简体中文|[English](HOT_LOADING_IN_SERVING.md)) (简体中文|[English](./Hot_Loading_EN.md))
## 背景 ## 背景
......
# Hot Loading in Paddle Serving # Hot Loading in Paddle Serving
([简体中文](HOT_LOADING_IN_SERVING_CN.md)|English) ([简体中文](./Hot_Loading_CN.md)|English)
## Background ## Background
......
...@@ -7,19 +7,17 @@ Paddle Serving服务端目前提供了支持Http直接访问的功能,本文 ...@@ -7,19 +7,17 @@ Paddle Serving服务端目前提供了支持Http直接访问的功能,本文
BRPC-Server端支持通过Http的方式被访问,各种语言都有实现Http请求的一些库,所以Java/Python/Go等BRPC支持不太完善的语言,可以通过Http的方式直接访问服务端进行预测。 BRPC-Server端支持通过Http的方式被访问,各种语言都有实现Http请求的一些库,所以Java/Python/Go等BRPC支持不太完善的语言,可以通过Http的方式直接访问服务端进行预测。
### Http方式 ### Http方式
基本流程和原理:客户端需要将数据按照Proto约定的格式(请参阅[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto))封装在Http请求的请求体中。 基本流程和原理:客户端需要将数据按照Proto约定的格式(请参阅[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto))封装在Http请求的请求体中。
BRPC-Server会尝试去JSON字符串中再去反序列化出Proto格式的数据,从而进行后续的处理。 BRPC-Server会尝试去JSON字符串中再去反序列化出Proto格式的数据,从而进行后续的处理。
### Http+protobuf方式 ### Http+protobuf方式
各种语言都提供了对ProtoBuf的支持,如果您对此比较熟悉,您也可以先将数据使用ProtoBuf序列化,再将序列化后的数据放入Http请求数据体中,然后指定Content-Type: application/proto,从而使用http/h2+protobuf二进制串访问服务。 各种语言都提供了对ProtoBuf的支持,如果您对此比较熟悉,您也可以先将数据使用ProtoBuf序列化,再将序列化后的数据放入Http请求数据体中,然后指定Content-Type: application/proto,从而使用http/h2+protobuf二进制串访问服务。
实测随着数据量的增大,使用JSON方式的Http的数据量和反序列化的耗时会大幅度增加,推荐当您的数据量较大时,使用Http+protobuf方式,目前已经在Java和Python的Client端提供了支持。 实测随着数据量的增大,使用JSON方式的Http的数据量和反序列化的耗时会大幅度增加,推荐当您的数据量较大时,使用Http+protobuf方式,目前已经在Java和Python的Client端提供了支持。
**理论上讲,序列化/反序列化的性能从高到底排序为:protobuf > http/h2+protobuf > http**
## 示例 ## 示例
我们将以python/examples/fit_a_line为例,讲解如何通过Http访问Server端。 我们将以examples/C++/fit_a_line为例,讲解如何通过Http访问Server端。
### 获取模型 ### 获取模型
...@@ -42,13 +40,13 @@ python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 - ...@@ -42,13 +40,13 @@ python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 -
为了方便用户快速的使用Http方式请求Server端预测服务,我们已经将常用的Http请求的数据体封装、压缩、请求加密等功能封装为一个HttpClient类提供给用户,方便用户使用。 为了方便用户快速的使用Http方式请求Server端预测服务,我们已经将常用的Http请求的数据体封装、压缩、请求加密等功能封装为一个HttpClient类提供给用户,方便用户使用。
使用HttpClient最简单只需要四步,1、创建一个HttpClient对象。2、加载Client端的prototxt配置文件(本例中为python/examples/fit_a_line/目录下的uci_housing_client/serving_client_conf.prototxt)。3、调用connect函数。4、调用Predict函数,通过Http方式请求预测服务。 使用HttpClient最简单只需要四步,1、创建一个HttpClient对象。2、加载Client端的prototxt配置文件(本例中为examples/C++/fit_a_line目录下的uci_housing_client/serving_client_conf.prototxt)。3、调用connect函数。4、调用Predict函数,通过Http方式请求预测服务。
此外,您可以根据自己的需要配置Server端IP、Port、服务名称(此服务名称需要与[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto)文件中的Service服务名和rpc方法名对应,即`GeneralModelService`字段和`inference`字段),设置Request数据体压缩,设置Response支持压缩传输,模型加密预测(需要配置Server端使用模型加密)、设置响应超时时间等功能。 此外,您可以根据自己的需要配置Server端IP、Port、服务名称(此服务名称需要与[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto)文件中的Service服务名和rpc方法名对应,即`GeneralModelService`字段和`inference`字段),设置Request数据体压缩,设置Response支持压缩传输,模型加密预测(需要配置Server端使用模型加密)、设置响应超时时间等功能。
Python的HttpClient使用示例见[`python/examples/fit_a_line/test_httpclient.py`](../python/examples/fit_a_line/test_httpclient.py),接口详见[`python/paddle_serving_client/httpclient.py`](../python/paddle_serving_client/httpclient.py) Python的HttpClient使用示例见[`examples/C++/fit_a_line/test_httpclient.py`](../../examples/C++/fit_a_line/test_httpclient.py),接口详见[`python/paddle_serving_client/httpclient.py`](../../python/paddle_serving_client/httpclient.py)
Java的HttpClient使用示例见[`java/examples/src/main/java/PaddleServingClientExample.java`](../java/examples/src/main/java/PaddleServingClientExample.java)接口详见[`java/src/main/java/io/paddle/serving/client/HttpClient.java`](../java/src/main/java/io/paddle/serving/client/HttpClient.java) Java的HttpClient使用示例见[`java/examples/src/main/java/PaddleServingClientExample.java`](../../java/examples/src/main/java/PaddleServingClientExample.java)接口详见[`java/src/main/java/io/paddle/serving/client/Client.java`](../../java/src/main/java/io/paddle/serving/client/Client.java)
如果不能满足您的需求,您也可以在此基础上添加一些功能。 如果不能满足您的需求,您也可以在此基础上添加一些功能。
...@@ -64,7 +62,7 @@ curl -XPOST http://0.0.0.0:9393/GeneralModelService/inference -d ' {"tensor":[{" ...@@ -64,7 +62,7 @@ curl -XPOST http://0.0.0.0:9393/GeneralModelService/inference -d ' {"tensor":[{"
``` ```
其中`127.0.0.1:9393`为IP和Port,根据您服务端启动的IP和Port自行设定。 其中`127.0.0.1:9393`为IP和Port,根据您服务端启动的IP和Port自行设定。
`GeneralModelService`字段和`inference`字段分别为Proto文件中的Service服务名和rpc方法名,详见[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto) `GeneralModelService`字段和`inference`字段分别为Proto文件中的Service服务名和rpc方法名,详见[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto)
-d后面的是请求的数据体,json中一定要包含上述proto中的required字段,否则转化会失败,对应请求会被拒绝。 -d后面的是请求的数据体,json中一定要包含上述proto中的required字段,否则转化会失败,对应请求会被拒绝。
......
# Inference Protocols
C++ Serving基于BRPC进行服务构建,支持BRPC、GRPC、RESTful请求。请求数据为protobuf格式,详见`core/general-server/proto/general_model_service.proto`。本文介绍构建请求以及解析结果的方法。
## Tensor
Tensor可以装载多种类型的数据,是Request和Response的基础单元。Tensor的具体定义如下:
```protobuf
message Tensor {
// VarType: INT64
repeated int64 int64_data = 1;
// VarType: FP32
repeated float float_data = 2;
// VarType: INT32
repeated int32 int_data = 3;
// VarType: FP64
repeated double float64_data = 4;
// VarType: UINT32
repeated uint32 uint32_data = 5;
// VarType: BOOL
repeated bool bool_data = 6;
// (No support)VarType: COMPLEX64, 2x represents the real part, 2x+1
// represents the imaginary part
repeated float complex64_data = 7;
// (No support)VarType: COMPLEX128, 2x represents the real part, 2x+1
// represents the imaginary part
repeated double complex128_data = 8;
// VarType: STRING
repeated string data = 9;
// Element types:
// 0 => INT64
// 1 => FP32
// 2 => INT32
// 3 => FP64
// 4 => INT16
// 5 => FP16
// 6 => BF16
// 7 => UINT8
// 8 => INT8
// 9 => BOOL
// 10 => COMPLEX64
// 11 => COMPLEX128
// 20 => STRING
int32 elem_type = 10;
// Shape of the tensor, including batch dimensions.
repeated int32 shape = 11;
// Level of data(LOD), support variable length data, only for fetch tensor
// currently.
repeated int32 lod = 12;
// Correspond to the variable 'name' in the model description prototxt.
string name = 13;
// Correspond to the variable 'alias_name' in the model description prototxt.
string alias_name = 14; // get from the Model prototxt
// VarType: FP16, INT16, INT8, BF16, UINT8
bytes tensor_content = 15;
};
```
- elem_type:数据类型,当前支持FLOAT32, INT64, INT32, UINT8, INT8, FLOAT16
|elem_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:数据维度
- lod:lod信息,LoD(Level-of-Detail) Tensor是Paddle的高级特性,是对Tensor的一种扩充,用于支持更自由的数据输入。详见[LOD](../LOD_CN.md)
- name/alias_name: 名称及别名,与模型配置对应
### 构建FLOAT32数据Tensor
```C
// 原始数据
std::vector<float> float_data;
Tensor *tensor = new Tensor;
// 设置维度,可以设置多维
for (uint32_t j = 0; j < float_shape.size(); ++j) {
tensor->add_shape(float_shape[j]);
}
// 设置LOD信息
for (uint32_t j = 0; j < float_lod.size(); ++j) {
tensor->add_lod(float_lod[j]);
}
// 设置类型、名称及别名
tensor->set_elem_type(1);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
// 拷贝数据
int total_number = float_data.size();
tensor->mutable_float_data()->Resize(total_number, 0);
memcpy(tensor->mutable_float_data()->mutable_data(), float_datadata(), total_number * sizeof(float));
```
### 构建INT8数据Tensor
```C
// 原始数据
std::string string_data;
Tensor *tensor = new Tensor;
for (uint32_t j = 0; j < string_shape.size(); ++j) {
tensor->add_shape(string_shape[j]);
}
for (uint32_t j = 0; j < string_lod.size(); ++j) {
tensor->add_lod(string_lod[j]);
}
tensor->set_elem_type(8);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
tensor->set_tensor_content(string_data);
```
## Request
Request为客户端需要发送的请求数据,其以Tensor为基础数据单元,并包含了额外的请求信息。定义如下:
```protobuf
message Request {
repeated Tensor tensor = 1;
repeated string fetch_var_names = 2;
bool profile_server = 3;
uint64 log_id = 4;
};
```
- fetch_vat_names: 需要获取的输出数据名称,在GeneralResponseOP会根据该列表进行过滤.请参考模型文件serving_client_conf.prototxt中的`fetch_var`字段下的`alias_name`
- profile_server: 调试参数,打开时会输出性能信息
- log_id: 请求ID
### 构建Request
当使用BRPC或GRPC进行请求时,使用protobuf形式数据,构建方式如下:
```C
Request req;
req.set_log_id(log_id);
for (auto &name : fetch_name) {
req.add_fetch_var_names(name);
}
// 添加Tensor
Tensor *tensor = req.add_tensor();
...
```
当使用RESTful请求时,可以使用JSON形式数据,具体格式如下:
```JSON
{"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"name":"x","alias_name":"x","shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}
```
## Response
Response为服务端返回给客户端的结果,包含了Tensor数据、错误码、错误信息等。定义如下:
```protobuf
message Response {
repeated ModelOutput outputs = 1;
repeated int64 profile_time = 2;
// Error code
int32 err_no = 3;
// Error messages
string err_msg = 4;
};
message ModelOutput {
repeated Tensor tensor = 1;
string engine_name = 2;
}
```
- profile_time:当设置request->set_profile_server(true)时,会返回性能信息
- err_no:错误码,详见`core/predictor/common/constant.h`
- err_msg:错误信息,详见`core/predictor/common/constant.h`
- engine_name:输出节点名称
|err_no|err_msg|
|---------|----|
|0|OK|
|-5000|"Paddle Serving Framework Internal Error."|
|-5001|"Paddle Serving Memory Alloc Error."|
|-5002|"Paddle Serving Array Overflow Error."|
|-5100|"Paddle Serving Op Inference Error."|
### 读取Response数据
```C
uint32_t model_num = res.outputs_size();
for (uint32_t m_idx = 0; m_idx < model_num; ++m_idx) {
std::string engine_name = output.engine_name();
int idx = 0;
// 读取tensor维度
int shape_size = output.tensor(idx).shape_size();
for (int i = 0; i < shape_size; ++i) {
shape[i] = output.tensor(idx).shape(i);
}
// 读取LOD信息
int lod_size = output.tensor(idx).lod_size();
if (lod_size > 0) {
lod.resize(lod_size);
for (int i = 0; i < lod_size; ++i) {
lod[i] = output.tensor(idx).lod(i);
}
}
// 读取float数据
int size = output.tensor(idx).float_data_size();
float_data = std::vector<float>(
output.tensor(idx).float_data().begin(),
output.tensor(idx).float_data().begin() + size);
// 读取int8数据
string_data = output.tensor(idx).tensor_content();
}
```
\ No newline at end of file
# C++ Serving 简要介绍
## 适用场景
C++ Serving主打性能,如果您想搭建企业级的高性能线上推理服务,对高并发、低延时有一定的要求。C++ Serving框架可能会更适合您。目前无论是使用同步/异步模型,[C++ Serving与TensorFlow Serving性能对比](./Benchmark_CN.md)均有优势。
C++ Serving网络框架使用brpc,核心执行引擎是基于C/C++编写,并且提供强大的工业级应用能力,包括模型热加载、模型加密部署、A/B Test、多模型组合、同步/异步模式、支持多语言多协议Client等功能。
## 1.网络框架(BRPC)
C++ Serving采用[brpc框架](https://github.com/apache/incubator-brpc)进行Client/Server端的通信。brpc是百度开源的一款PRC网络框架,具有高并发、低延时等特点,已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务,稳定可靠。与gRPC网络框架相比,具有更低的延时,更高的并发性能,且底层支持<mark>**brpc/grpc/http+json/http+proto**</mark>等多种协议;缺点是跨操作系统平台能力不足。详细的框架性能开销见[C++ Serving框架性能测试](./Frame_Performance_CN.md)
## 2.核心执行引擎
C++ Serving的核心执行引擎是一个有向无环图(也称作[DAG图](./DAG_CN.md)),DAG图中的每个节点(在PaddleServing中,借用模型中operator算子的概念,将DAG图中的节点也称为[OP](./OP_CN.md))代表预估服务的一个环节,DAG图支持多个OP按照串并联的方式进行组合,从而实现在一个服务中完成多个模型的预测整合最终产出结果。整个框架原理如下图所示,可分为Client Side 和 Server Side。
<p align="center">
<br>
<img src='../images/design_doc.png'">
<br>
<p>
### 2.1 Client Side
如图所示,Client端通过Pybind API接口将Request请求,按照ProtoBuf协议进行序列化后,经由BRPC网络框架Client端发送给Server端。此时,Client端等待Server端的返回数据并反序列化为正常的数据,之后将结果返给Client调用方。
### 2.2 Server Side
Server端接收到序列化的Request请求后,反序列化正常数据,进入图执行引擎,按照定义好的DAG图结构,执行每个OP环节的操作(每个OP环节的处理由用户定义,即可以只是单纯的数据处理,也可以是调用预测引擎用不同的模型对输入数据进行预测),当DAG图中所有OP环节均执行完成后,将结果数据序列化后返回给Client端。
### 2.3 通信数据格式ProtoBuf
Protocol Buffers(简称Protobuf) ,是Google出品的序列化框架,与开发语言无关,和平台无关,具有良好的可扩展性。Protobuf和所有的序列化框架一样,都可以用于数据存储、通讯协议。Protobuf支持生成代码的语言包括Java、Python、C++、Go、JavaNano、Ruby、C#。Portobuf的序列化的结果体积要比XML、JSON小很多,速度比XML、JSON快很多。
在C++ Serving中定义了Client Side 和 Server Side之间通信的ProtoBuf,详细的字段的介绍见《[C++ Serving ProtoBuf简介](./Inference_Protocols_CN.md)》。
## 3.Server端特性
### 3.1 启动Server端
Server端的核心是一个由项目代码编译产生的名称为serving的二进制可执行文件,启动serving时需要用户指定一些参数(<mark>**例如,网络IP和Port端口、brpc线程数、使用哪个显卡、模型文件路径、模型是否开启trt、XPU推理、模型精度设置等等**</mark>),有些参数是通过命令行直接传入的,还有一些是写在指定的配置文件中配置文件中。
为了方便用户快速的启动C++ Serving的Server端,除了用户自行修改配置文件并通过命令行传参运行serving二进制可执行文件以外,我们也提供了另外一种通过python脚本启动的方式。python脚本启动本质上仍是运行serving二进制可执行文件,但python脚本中会自动完成两件事:1、配置文件的生成;2、根据需要配置的参数,生成命令行,通过命令行的方式,传入参数信息并运行serving二进制可执行文件。
更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
### 3.2 同步/异步模式
同步模式比较简单直接,适用于模型预测时间短,单个Request请求的batch已经比较大的情况。
同步模型下,Server端线程数N = 模型预测引擎数N = 同时处理Request请求数N,超发的Request请求需要等待当前线程处理结束后才能得到响应和处理。
<p align="center">
<img src='../images/syn_mode.png' width = "350" height = "300">
<p>
异步模型主要适用于模型支持多batch(最大batch数M可通过配置选项指定),单个Request请求的batch较小(batch << M),单次预测时间较长的情况。
异步模型下,Server端N个线程只负责接收Request请求,实际调用预测引擎是在异步框架的线程池中,异步框架的线程数可以由配置选项来指定。为了方便理解,我们假设每个Request请求的batch均为1,此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n),调用1次预测引擎,得到1个Response(batch = n),再将其对应拆分为n个Response作为返回结果。
<p align="center">
<img src='../images/asyn_mode.png'">
<p>
更多关于模式参数配置以及性能调优的介绍见《[C++ Serving性能调优](./Performance_Tuning_CN.md)》。
### 3.3 多模型组合
当用户需要多个模型组合处理结果来作为一个服务接口对外暴露时,通常的解决办法是搭建内外两层服务,内层服务负责跑模型预测,外层服务负责串联和前后处理。当传输的数据量不大时,这样做的性能开销并不大,但当输出的数据量较大时,因为网络传输而带来的性能开销不容忽视(实测单次传输40MB数据时,RPC耗时为160-170ms)。
<p align="center">
<br>
<img src='../images/multi_model.png'>
<br>
<p>
C++ Serving框架支持[自定义DAG图](./Model_Ensemble_CN.md)的方式来表示多模型之间串并联组合关系,也支持用户[使用C++开发自定义OP节点](./OP_CN.md)。相比于使用内外两层服务来提供多模型组合处理的方式,由于节省了一次RPC网络传输的开销,把多模型在一个服务中处理性能上会有一定的提升,尤其当RPC通信传输的数据量较大时。
### 3.4 模型管理与热加载
C++ Serving的引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。C++ Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[C++ Serving中的模型热加载](./Hot_Loading_CN.md)》。
### 3.5 模型加解密
C++ Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[C++ Serving加密模型预测](./Encryption_CN.md)》。
## 4.Client端特性
### 4.1 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](./ABTest_CN.md)》。
<p align="center">
<br>
<img src='../images/abtest.png' width = "345" height = "230">
<br>
<p>
### 4.2 多语言多协议Client
BRPC网络框架支持[多种底层通信协议](#1.网络框架(BRPC)),即使用目前的C++ Serving框架的Server端,各种语言的Client端,甚至使用curl的方式,只要按照上述协议(具体支持的协议见[brpc官网](https://github.com/apache/incubator-brpc))封装数据并发送,Server端就能够接收、处理和返回结果。
对于支持的各种协议我们提供了部分的Client SDK示例供用户参考和使用,用户也可以根据自己的需求去开发新的Client SDK,也欢迎用户添加其他语言/协议(例如GRPC-Go、GRPC-C++ HTTP2-Go、HTTP2-Java等)Client SDK到我们的仓库供其他开发者借鉴和参考。
| 通信协议 | 速度 | 是否支持 | 是否提供Client SDK |
|-------------|-----|---------|-------------------|
| BRPC | 最快 | 支持 | [C++](../../core/general-client/README_CN.md)[Python(Pybind方式)](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP2+Proto | 快 | 支持 | coming soon |
| GRPC | 快 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Proto | 一般 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Json | 慢 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md)[Curl](Http_Service_CN.md) |
# Paddle Serving中的集成预测 # Paddle Serving中的集成预测
(简体中文|[English](MODEL_ENSEMBLE_IN_PADDLE_SERVING.md)) (简体中文|[English](./Model_Ensemble_EN.md))
在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。 在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。 需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
样例中用到的代码保存在`python/examples/imdb`路径下: 样例中用到的代码保存在`examples/C++/imdb`路径下:
```shell ```shell
. .
......
# Model Ensemble in Paddle Serving # Model Ensemble in Paddle Serving
([简体中文](MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)|English) ([简体中文](Model_Ensemble_CN.md)|English)
In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature. In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature.
...@@ -14,7 +14,7 @@ In this example (see the figure below), the server side predict the bow and CNN ...@@ -14,7 +14,7 @@ In this example (see the figure below), the server side predict the bow and CNN
It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same. It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.
The code used in the example is saved in the `python/examples/imdb` path: The code used in the example is saved in the `examples/C++/imdb` path:
```shell ```shell
. .
......
# 如何开发一个新的General Op? # 如何开发一个新的General Op?
(简体中文|[English](./NEW_OPERATOR.md)) (简体中文|[English](./OP_EN.md))
在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。 在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
......
# How to write an general operator? # How to write an general operator?
([简体中文](./NEW_OPERATOR_CN.md)|English) ([简体中文](./OP_CN.md)|English)
In this document, we mainly focus on how to develop a new server side operator for PaddleServing. Before we start to write a new operator, let's look at some sample code to get the basic idea of writing a new operator for server. We assume you have known the basic computation logic on server side of PaddleServing, please reference to []() if you do not know much about it. The following code can be visited at `core/general-server/op` of Serving repo. In this document, we mainly focus on how to develop a new server side operator for PaddleServing. Before we start to write a new operator, let's look at some sample code to get the basic idea of writing a new operator for server. We assume you have known the basic computation logic on server side of PaddleServing, please reference to []() if you do not know much about it. The following code can be visited at `core/general-server/op` of Serving repo.
......
# C++ Serving性能分析与优化
# 1.背景知识介绍
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](./Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
2) 关于C++ Serving框架本身的性能分析和介绍,请参考[C++ Serving框架性能测试](./Frame_Performance_CN.md)
3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解,例如,您使用CPU还是GPU进行预测;是否可以开启TRT进行加速;你的机器CPU是多少core的;您的业务包含几个模型;每个模型的输入和输出需要做些什么处理;您业务的最大线上流量是多少;您的模型支持的最大输入batch是多少等等.
# 2.Server线程数
首先,Server端线程数N并不是越大越好。众所周知,线程的切换涉及到用户空间和内核空间的切换,有一定的开销,当您的core数=1,而线程数为100000时,线程的频繁切换将带来不可忽视的性能开销。
在BRPC框架中,用户态协程worker数M >> 线程数N,用户态协程worker会工作在任意一个线程中,当RPC网络传输IO操作让出CPU资源时,BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以,极端情况下,若您的代码中除RPC通信外,没有阻塞线程的任何IO或网络操作,您的线程数完全可以 == 机器core数量,您不必担心N个线程都在进行RPC网络IO,而导致CPU利用率不高的问题。
Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑:
## 2.1 最大并发请求量M
根据最大并发请求量来设置Server端线程数N,根据[C++ Serving框架性能测试](./Frame_Performance_CN.md)中的数据来看,此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>,此时平均处理时延最小。
这也很容易理解,举个极端的例子,如果您每次只有1个请求,那此时Server端线程数设置1是最合理的,因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数,必然就带来了线程切换的开销。
## 2.2 机器core数量C
根据机器core数量来设置Server端线程数N,众所周知,线程是CPU core调度执行的最小单元,若要在一个进程内充分使用所有的core,<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定,一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
## 2.3 模型预测时间长短T
当您使用CPU进行预测时,预测阶段的计算是使用CPU完成的,此时,请参考前两者来进行设置线程数。
当您使用GPU进行预测时,情况有些不同,此时预测阶段的计算是由GPU完成的,此时CPU资源是空闲的,而预测操作是阻塞该线程的,类似于Sleep操作,此时若您的线程数==机器core数量,将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说,当模型预测时间较短时(<10ms),Server端线程数不宜过多(线程数=1~10倍core数量),否则线程切换带来的开销不可忽视。当模型预测时间较长时,Server端线程数应稍大一些(线程数=4~200倍core数量)。
# 3.异步模式
<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时,采用异步模式的收益是明显的。
异步模型的原理是将模型预测阶段与RPC线程脱离,模型单独开辟一个线程数可指定的线程池,RPC收到Request后将请求数据放入模型的线程池中的Task队列中,线程池中的线程从Task中取出数据合并Batch后进行预测,从而提升QPS,更多详细的介绍见[C++Serving功能简介](./Introduction_CN.md),同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](./Benchmark_CN.md),在上述测试的条件下,异步模型比同步模式快百分50%。
异步模式的开启有以下两种方式。
## 3.1 Python命令辅助启动C++Server
`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式,其中2表示的是该模型异步线程池中的线程数为2,该数值默认值为0,此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入,此参数只有在异步模型开启的状态下,才有效。
## 3.2 命令行+配置文件启动C++Server
此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
# 4.多模型组合
<mark>**您的业务中需要调用多个模型进行预测**</mark>时,如果您追求极致的性能,您可以考虑使用C++Serving[自定义OP](./OP_CN.md)[自定义DAG图](./DAG_CN.md)的方式来实现上述需求。
## 4.1 优点
由于在一个服务中做模型的组合,节省了网络IO的时间和序列化反序列化的时间,尤其当数据量比较大时,收益十分明显(实测单次传输40MB数据时,RPC耗时为160-170ms)。
## 4.2 缺点
1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
2) 若多个模型之间需要前后处理,您也需要使用C++在OP之间去编写这部分代码。
3) 需要重新编译Server端代码。
## 4.3 示例
请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)`C++ OCR Service服务章节`[Paddle Serving中的集成预测](./Model_Ensemble_CN.md)中的例子。
# How to compile PaddleServing
([简体中文](./COMPILE_CN.md)|English)
## Compilation environment requirements
| module | version |
| :--------------------------: | :-------------------------------: |
| OS | Ubuntu16 and 18/CentOS 7 |
| gcc | 5.4.0(Cuda 10.1) and 8.2.0 |
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later |
| Python | 3.6.0 and later |
| Go | 1.9.2 and later |
| git | 2.17.1 and later |
| glibc-static | 2.17 |
| openssl-devel | 1.0.2k |
| bzip2-devel | 1.0.6 and later |
| python3-devel | 3.6.0 and later |
| sqlite-devel | 3.7.17 and later |
| patchelf | 0.9 |
| libXext | 1.3.3 |
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md).
## Get Code
``` python
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
```
## PYTHONROOT settings
```shell
# For example, the path of python is /usr/bin/python, you can set PYTHONROOT
export PYTHONROOT=/usr
```
If you are using a Docker development image, please follow the following to determine the Python version to be compiled, and set the corresponding environment variables
```
#Python3.6
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.6m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.6m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.6
#Python3.7
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
#Python3.8
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.8
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.8.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.8
```
## Install Python dependencies
```shell
pip install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple
```
If you use other Python version, please use the right `pip` accordingly.
## GOPATH Setting
The default GOPATH is set to `$HOME/go`, you can also set it to other values. **If it is the Docker environment provided by Serving, you do not need to set up.**
```shell
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
```
## Get go packages
```shell
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go get -u github.com/golang/protobuf/protoc-gen-go@v1.4.3
go get -u google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
## Compile Server
### Integrated CPU version paddle inference library
``` shell
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON ..
make -j10
```
you can execute `make install` to put targets under directory `./output`, you need to add`-DCMAKE_INSTALL_PREFIX=./output`to specify output path to cmake command shown above.
### Integrated GPU version paddle inference library
Compared with CPU environment, GPU environment needs to refer to the following table,
**It should be noted that the following table is used as a reference for non-Docker compilation environment. The Docker compilation environment has been configured with relevant parameters and does not need to be specified in cmake process. **
| cmake environment variable | meaning | GPU environment considerations | whether Docker environment is needed |
|-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda installation path, usually /usr/local/cuda | Required for all environments | No (/usr/local/cuda) |
| CUDNN_LIBRARY | The directory where libcudnn.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | The directory where libcudart.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | The upper level directory of the directory where libnvinfer.so.* is located, depends on the TensorRT installation directory | Cuda 9.0/10.0 does not need, other needs | No (/usr) |
If not in Docker environment, users can refer to the following execution methods. The specific path is subject to the current environment, and the code is only for reference.TENSORRT_LIBRARY_PATH is related to the TensorRT version and should be set according to the actual situation。For example, in the cuda10.1 environment, the TensorRT version is 6.0 (/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/),In the cuda10.2 and cuda11.0 environment, the TensorRT version is 7.1 (/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/).
``` shell
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"
mkdir server-build-gpu && cd server-build-gpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCUDA_TOOLKIT_ROOT_DIR=${CUDA_PATH} \
-DCUDNN_LIBRARY=${CUDNN_LIBRARY} \
-DCUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY} \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \
-DWITH_GPU=ON ..
make -j10
```
Execute `make install` to put the target output in the `./output` directory.
### Compile C++ Server under the condition of WITH_OPENCV=ON
**Note:** Only when you need to redevelop the paddle serving C + + part, and the new code depends on the OpenCV library, you need to do so.
First of all , OpenCV library should be installed, if not, please refer to the `Compile and install OpenCV` section later in this article.
In the compile command, add `DOPENCV_DIR=${OPENCV_DIR}` and `DWITH_OPENCV=ON`,for example:
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir` is the installation path of OpenCV library。
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes).
## Compile Client
``` shell
mkdir client-build && cd client-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON ..
make -j10
```
execute `make install` to put targets under directory `./output`
## Compile the App
```bash
mkdir app-build && cd app-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON ..
make
```
## Install wheel package
Regardless of the client, server or App part, after compiling, install the whl package in `python/dist/` in the temporary directory(`server-build-cpu`, `server-build-gpu`, `client-build`,`app-build`) of the compilation process.
for example:cd server-build-cpu/python/dist && pip install -U xxxxx.whl
## Note
When running the python server, it will check the `SERVING_BIN` environment variable. If you want to use your own compiled binary file, set the environment variable to the path of the corresponding binary file, usually`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`.
BUILD_DIR is the absolute path of server build CPU or server build GPU。
for example: cd server-build-cpu && export SERVING_BIN=${PWD}/core/general-server/serving
## Verify
Please use the example under `python/examples` to verify.
## CMake Option Description
| Compile Options | Description | Default |
| :--------------: | :----------------------------------------: | :--: |
| WITH_AVX | Compile Paddle Serving with AVX intrinsics | OFF |
| WITH_MKL | Compile Paddle Serving with MKL support | OFF |
| WITH_GPU | Compile Paddle Serving with NVIDIA GPU | OFF |
| WITH_OPENCV | Compile Paddle Serving with OPENCV | OFF |
| CUDNN_LIBRARY | Define CuDNN library and header path | |
| CUDA_TOOLKIT_ROOT_DIR | Define CUDA PATH | |
| TENSORRT_ROOT | Define TensorRT PATH | |
| CLIENT | Compile Paddle Serving Client | OFF |
| SERVER | Compile Paddle Serving Server | OFF |
| APP | Compile Paddle Serving App package | OFF |
| PACK | Compile for whl | OFF |
### WITH_GPU Option
Paddle Serving supports prediction on the GPU through the PaddlePaddle inference library. The WITH_GPU option is used to detect basic libraries such as CUDA/CUDNN on the system. If an appropriate version is detected, the GPU Kernel will be compiled when PaddlePaddle is compiled.
To compile the Paddle Serving GPU version on bare metal, you need to install these basic libraries:
- CUDA
- CuDNN
To compile the TensorRT version, you need to install the TensorRT library.
Note here:
1. The basic library versions such as CUDA/CUDNN installed on the system where Serving is compiled, needs to be compatible with the actual GPU device. For example, the Tesla V100 card requires at least CUDA 9.0. If the version of the basic library such as CUDA used during compilation is too low, the generated GPU code is not compatible with the actual hardware device, which will cause the Serving process to fail to start or serious problems such as coredump.
2. Install the CUDA driver compatible with the actual GPU device on the system running Paddle Serving, and install the basic library compatible with the CUDA/CuDNN version used during compilation. If the version of CUDA/CuDNN installed on the system running Paddle Serving is lower than the version used at compile time, it may cause some cuda function call failures and other problems.
The following is the base library version matching relationship used by the PaddlePaddle release version for reference:
| | CUDA | CuDNN | TensorRT |
| :----: | :-----: | :----------: | :----: |
| post101 | 10.1 | CuDNN 7.6.5 | 6.0.1 |
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### How to make the compiler detect the CuDNN library
Download the corresponding CUDNN version from NVIDIA developer official website and decompressing it, add `-DCUDNN_ROOT` to cmake command, to specify the path of CUDNN.
## Compile and install OpenCV
**Note:** You need to do this only if you need to import the opencv library into your C + + code.
* First of all, you need to download the source code compiled package in the Linux environment from the OpenCV official website. Taking OpenCV3.4.7 as an example, the download command is as follows.
```
wget https://github.com/opencv/opencv/archive/3.4.7.tar.gz
tar -xf 3.4.7.tar.gz
```
Finally, you can see the folder of `opencv-3.4.7/` in the current directory.
* Compile OpenCV, the OpenCV source path (`root_path`) and installation path (`install_path`) should be set by yourself. Enter the OpenCV source code path and compile it in the following way.
```shell
root_path=your_opencv_root_path
install_path=${root_path}/opencv3
rm -rf build
mkdir build
cd build
cmake .. \
-DCMAKE_INSTALL_PREFIX=${install_path} \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DWITH_IPP=OFF \
-DBUILD_IPP_IW=OFF \
-DWITH_LAPACK=OFF \
-DWITH_EIGEN=OFF \
-DCMAKE_INSTALL_LIBDIR=lib64 \
-DWITH_ZLIB=ON \
-DBUILD_ZLIB=ON \
-DWITH_JPEG=ON \
-DBUILD_JPEG=ON \
-DWITH_PNG=ON \
-DBUILD_PNG=ON \
-DWITH_TIFF=ON \
-DBUILD_TIFF=ON
make -j
make install
```
Among them, `root_path` is the downloaded OpenCV source code path, and `install_path` is the installation path of OpenCV. After `make install` is completed, the OpenCV header file and library file will be generated in this folder for later source code compilation.
The final file structure under the OpenCV installation path is as follows.
```
opencv3/
|-- bin
|-- include
|-- lib
|-- lib64
|-- share
```
# 如何编译PaddleServing # 如何编译PaddleServing
(简体中文|[English](./COMPILE.md)) (简体中文|[English](./Compile_EN.md))
## 编译环境设置 ## 总体概述
编译Paddle Serving一共分以下几步
- 编译环境准备:根据模型和运行环境的需要,选择最合适的镜像
- 下载代码库:下载Serving代码库,按需要执行初始化操作
- 环境变量准备:根据运行环境的需要,确定Python各个环境变量,如GPU环境还需要确定Cuda,Cudnn,TensorRT等环境变量。
- 正式编译: 编译`paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app`相关whl包
- 安装相关whl包:安装编译出的三个whl包,并设置SERVING_BIN环境变量
此外,针对某些C++二次开发场景,我们也提供了OPENCV的联编方案。
## 编译环境准备
| 组件 | 版本要求 | | 组件 | 版本要求 |
| :--------------------------: | :-------------------------------: | | :--------------------------: | :-------------------------------: |
...@@ -11,7 +26,7 @@ ...@@ -11,7 +26,7 @@
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 | | gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later | | cmake | 3.2.0 and later |
| Python | 3.6.0 and later | | Python | 3.6.0 and later |
| Go | 1.9.2 and later | | Go | 1.17.2 and later |
| git | 2.17.1 and later | | git | 2.17.1 and later |
| glibc-static | 2.17 | | glibc-static | 2.17 |
| openssl-devel | 1.0.2k | | openssl-devel | 1.0.2k |
...@@ -23,109 +38,151 @@ ...@@ -23,109 +38,151 @@
| libSM | 1.2.2 | | libSM | 1.2.2 |
| libXrender | 0.9.10 | | libXrender | 0.9.10 |
推荐使用Docker编译,我们已经为您准备好了Paddle Serving编译环境并配置好了上述编译依赖,详见[该文档](DOCKER_IMAGES_CN.md) 推荐使用Docker编译,我们已经为您准备好了Paddle Serving编译环境并配置好了上述编译依赖,详见[该文档](Docker_Images_CN.md)
## 获取代码 我们提供了五个环境的开发镜像,分别是CPU, Cuda10.1+Cudnn7, Cuda10.2+Cudnn7,Cuda10.2+Cudnn8, Cuda11.2+Cudnn8。我们提供了Serving开发镜像涵盖以上环境。与此同时,我们也支持Paddle开发镜像。
``` python 其中Serving镜像名是 **paddlepaddle/serving:${Serving开发镜像Tag}**(如果网络不佳可以访问**registry.baidubce.com/paddlepaddle/serving:${Serving开发镜像Tag}**), Paddle开发镜像名是 **paddlepaddle/paddle:${Paddle开发镜像Tag}**。为了防止用户对两套镜像出现混淆,我们分别解释一下两套镜像的由来。
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive Serving开发镜像是Serving套件为了支持各个预测环境提供的用于编译、调试预测服务的镜像,Paddle开发镜像是Paddle在官网发布的用于编译、开发、训练模型使用镜像。为了让Paddle开发者能够在同一个容器内直接使用Serving。对于上个版本就已经使用Serving用户的开发者来说,Serving开发镜像应该不会感到陌生。但对于熟悉Paddle训练框架生态的开发者,目前应该更熟悉已有的Paddle开发镜像。为了适应所有用户的不同习惯,我们对这两套镜像都做了充分的支持。
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
我们首先要针对自己所需的环境拉取相关镜像。上表**环境**一列下,除了CPU,其余(Cuda**+Cudnn**)都属于GPU环境。
您可以使用Serving开发镜像。
``` ```
docker pull paddlepaddle/serving:${Serving开发镜像Tag}
## PYTHONROOT设置 # 如果是GPU镜像
nvidia-docker run --rm -it paddlepaddle/serving:${Serving开发镜像Tag} bash
```shell # 如果是CPU镜像
# 例如python的路径为/usr/bin/python,可以设置PYTHONROOT docker run --rm -it paddlepaddle/serving:${Serving开发镜像Tag} bash
export PYTHONROOT=/usr
``` ```
如果您使用的是Docker开发镜像,请按照如下,确定好需要编译的Python版本,设置对应的环境变量 也可以使用Paddle开发镜像。
``` ```
#Python3.6 docker pull paddlepaddle/paddle:${Paddle开发镜像Tag}
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.6m # 如果是GPU镜像,需要使用nvidia-docker
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.6m.so nvidia-docker run --rm -it paddlepaddle/paddle:${Paddle开发镜像Tag} bash
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.6
#Python3.7
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
#Python3.8
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.8
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.8.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.8
# 如果是CPU镜像
docker run --rm -it paddlepaddle/paddle:${Paddle开发镜像Tag} bash
``` ```
## 安装Python依赖
```shell ## 下载代码库
pip install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple **注明: 如果您正在使用Paddle开发镜像,需要在下载代码库后手动运行`bash env_install.sh`(如代码框的第三行所示)**
``` ```
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
如果使用其他Python版本,请使用对应版本的`pip` # Paddle开发镜像需要运行如下命令,Serving开发镜像不需要运行
bash tools/paddle_env_install.sh
```
## GOPATH 设置 ## 环境变量准备
默认 GOPATH 设置为 `$HOME/go`,您也可以设置为其他值。** 如果是Serving提供的Docker环境,可以不需要设置。** **设置PYTHON环境变量**
```shell
export GOPATH=$HOME/go 如果您使用的是Serving开发镜像,请按照如下,确定好需要编译的Python版本,设置对应的环境变量,一共需要设置三个环境变量,分别是`PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`。以下我们以python 3.7为例,介绍如何设置这三个环境变量。
export PATH=$PATH:$GOPATH/bin
1) 设置`PYTHON_INCLUDE_DIR`
搜索Python.h 所在的目录
```
find / -name Python.h
``` ```
通常会有类似于`**/include/python3.7/Python.h`出现,我们只需要取它的文件夹目录就好,比如找到`/usr/include/python3.7/Python.h`,那么我们只需要`export PYTHON_INCLUDE_DIR=/usr/include/python3.7/`就好。
如果没有找到。说明 1)没有安装开发版本的Python,需重新安装 2)权限不足无法查看相关系统目录。
## 获取 Go packages 2) 设置`PYTHON_LIBRARIES`
```shell 搜索 libpython3.7.so
go env -w GO111MODULE=on ```
go env -w GOPROXY=https://goproxy.cn,direct find / -name libpython3.7.so
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go get -u github.com/golang/protobuf/protoc-gen-go@v1.4.3
go get -u google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
``` ```
通常会有类似于`**/lib/libpython3.7.so`或者`**/lib/x86_64-linux-gnu/libpython3.7.so`出现,我们只需要取它的文件夹目录就好,比如找到`/usr/local/lib/libpython3.7.so`,那么我们只需要`export PYTHON_LIBRARIES=/usr/local/lib`就好。
如果没有找到,说明 1)静态编译Python,需要重新安装动态编译的Python 2)全县不足无法查看相关系统目录。
3) 设置`PYTHON_EXECUTABLE`
## 编译Server部分 直接查看python3.7路径
```
which python3.7
```
假如结果是`/usr/local/bin/python3.7`,那么直接设置`export PYTHON_EXECUTABLE=/usr/local/bin/python3.7`
### 集成CPU版本Paddle Inference Library 设置好这三个环境变量至关重要,设置完成后,我们便可以执行下列操作(以下是Paddle Cuda 11.2的开发镜像的PYTHON环境,如果是其他镜像,请更改相应的`PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`)。
``` shell
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON ..
make -j10
``` ```
# 以下三个环境变量是Paddle开发镜像Cuda11.2的环境,如其他镜像可能需要修改
export PYTHON_INCLUDE_DIR=/usr/include/python3.7m/
export PYTHON_LIBRARIES=/usr/lib/x86_64-linux-gnu/libpython3.7m.so
export PYTHON_EXECUTABLE=/usr/bin/python3.7
可以执行`make install`把目标产出放在`./output`目录下,cmake阶段需添加`-DCMAKE_INSTALL_PREFIX=./output`选项来指定存放路径。 export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
### 集成GPU版本Paddle Inference Library python -m install -r python/requirements.txt
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
go install google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
相比CPU环境,GPU环境需要参考以下表格, 如果您是GPU用户需要额外设置`CUDA_PATH`, `CUDNN_LIBRARY`, `CUDA_CUDART_LIBRARY``TENSORRT_LIBRARY_PATH`
**需要说明的是,以下表格对非Docker编译环境作为参考,Docker编译环境已经配置好相关参数,无需在cmake过程指定。** ```
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/"
```
环境变量的含义如下表所示。
| cmake环境变量 | 含义 | GPU环境注意事项 | Docker环境是否需要 | | cmake环境变量 | 含义 | GPU环境注意事项 | Docker环境是否需要 |
|-----------------------|-------------------------------------|-------------------------------|--------------------| |-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda安装路径,通常为/usr/local/cuda | 全部环境都需要 | 否(/usr/local/cuda) | | CUDA_TOOLKIT_ROOT_DIR | cuda安装路径,通常为/usr/local/cuda | 全部GPU环境都需要 | 否(/usr/local/cuda) |
| CUDNN_LIBRARY | libcudnn.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部环境都需要 | 否(/usr/local/cuda/lib64/) | | CUDNN_LIBRARY | libcudnn.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部GPU环境都需要 | 否(/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | libcudart.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部环境都需要 | 否(/usr/local/cuda/lib64/) | | CUDA_CUDART_LIBRARY | libcudart.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部GPU环境都需要 | 否(/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | libnvinfer.so.*所在目录的上一级目录,取决于TensorRT安装目录 | Cuda 9.0/10.0不需要,其他需要 | 否(/usr) | | TENSORRT_ROOT | libnvinfer.so.*所在目录的上一级目录,取决于TensorRT安装目录 | 全部GPU环境都需要 | 否(/usr) |
非Docker环境下,用户可以参考如下执行方式,具体的路径以当时环境为准,代码仅作为参考。TENSORRT_LIBRARY_PATH和TensorRT版本有关,要根据实际情况设置。例如在cuda10.1环境下TensorRT版本是6.0(/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/),在cuda10.2和cuda11.0环境下TensorRT版本是7.1(/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/)。
``` shell
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"
mkdir server-build-gpu && cd server-build-gpu ## 正式编译
我们一共需要编译三个目标,分别是`paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app`,其中`paddle-serving-server`需要区分CPU或者GPU版本。如果是CPU版本请运行,
### 编译paddle-serving-server
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON \
-DWITH_GPU=OFF ..
make -j20
cd ..
```
如果是GPU版本,请运行,
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
...@@ -135,83 +192,76 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ ...@@ -135,83 +192,76 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \ -DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \ -DSERVER=ON \
-DWITH_GPU=ON .. -DWITH_GPU=ON ..
make -j10 make -j20
``` cd ..
```
执行`make install`可以把目标产出放在`./output`目录下。
### 开启WITH_OPENCV选项编译C++ Server ### 编译paddle-serving-client 和 paddle-serving-app
**注意:** 只有当您需要对Paddle Serving C++部分进行二次开发,且新增的代码依赖于OpenCV库时,您才需要这样做。
编译Serving C++ Server部分,开启WITH_OPENCV选项时,需要已安装的OpenCV库,若尚未安装,可参考本文档后面的说明编译安装OpenCV库。
以开启WITH_OPENCV选项,编译CPU版本Paddle Inference Library为例,在上述编译命令基础上,加入`DOPENCV_DIR=${OPENCV_DIR}``DWITH_OPENCV=ON`选项。 接下来,我们继续编译client和app就可以了,这两个包的编译命令在所有平台通用,不区分CPU和GPU的版本。
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir`为opencv库的安装路径。
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
``` ```
# 编译paddle-serving-client
**注意:** 编译成功后,需要设置`SERVING_BIN`路径,详见后面的[注意事项](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#注意事项) mkdir build_client
cd build_client
## 编译Client部分
``` shell
mkdir client-build && cd client-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON .. -DCLIENT=ON ..
make -j10 make -j10
``` cd ..
执行`make install`可以把目标产出放在`./output`目录下。
## 编译App部分
```bash # 编译paddle-serving-app
mkdir app-build && cd app-build mkdir build_app
cd build_app
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON .. -DAPP=ON ..
make make -j10
cd ..
``` ```
## 安装相关whl包
```
pip3.7 install -r build_server/python/dist/*.whl
pip3.7 install -r build_client/python/dist/*.whl
pip3.7 install -r build_app/python/dist/*.whl
export SERVING_BIN=${PWD}/build_server/core/general-server/serving
```
## 注意事项
## 安装wheel包 注意到上一小节的最后一行`export SERVING_BIN`,运行python端Server时,会检查`SERVING_BIN`环境变量,如果想使用自己编译的二进制文件,请将设置该环境变量为对应二进制文件的路径,通常是`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`
其中BUILD_DIR为`build_server`的绝对路径。
无论是Client端,Server端还是App部分,编译完成后,安装编译过程临时目录(`server-build-cpu``server-build-gpu``client-build``app-build`)下的`python/dist/` 中的whl包即可。 可以cd build_server路径下,执行`export SERVING_BIN=${PWD}/core/general-server/serving`
例如:cd server-build-cpu/python/dist && pip install -U xxxxx.whl
## 开启WITH_OPENCV选项编译C++ Server
## 注意事项 **注意:** 只有当您需要对Paddle Serving C++部分进行二次开发,且新增的代码依赖于OpenCV库时,您才需要这样做。
运行python端Server时,会检查`SERVING_BIN`环境变量,如果想使用自己编译的二进制文件,请将设置该环境变量为对应二进制文件的路径,通常是`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving` 编译Serving C++ Server部分,开启WITH_OPENCV选项时,需要已安装的OpenCV库,若尚未安装,可参考本文档后面的说明编译安装OpenCV库。
其中BUILD_DIR为server-build-cpu或server-build-gpu的绝对路径。
可以cd server-build-cpu路径下,执行`export SERVING_BIN=${PWD}/core/general-server/serving`
以开启WITH_OPENCV选项,编译CPU版本Paddle Inference Library为例,在上述编译命令基础上,加入`DOPENCV_DIR=${OPENCV_DIR}``DWITH_OPENCV=ON`选项。
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir`为opencv库的安装路径。
mkdir build_server && cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**注意:** 编译成功后,需要设置`SERVING_BIN`路径,详见后面的[注意事项](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#注意事项)
## 如何验证
请使用 `python/examples` 下的例子进行验证。
## CMake选项说明 ## 附:CMake选项说明
| 编译选项 | 说明 | 默认 | | 编译选项 | 说明 | 默认 |
| :--------------: | :----------------------------------------: | :--: | | :--------------: | :----------------------------------------: | :--: |
...@@ -252,11 +302,11 @@ Paddle Serving通过PaddlePaddle预测库支持在GPU上做预测。WITH_GPU选 ...@@ -252,11 +302,11 @@ Paddle Serving通过PaddlePaddle预测库支持在GPU上做预测。WITH_GPU选
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 | | post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 | | post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### 如何让Paddle Serving编译系统探测到CuDNN库 ### 附:如何让Paddle Serving编译系统探测到CuDNN库
从NVIDIA developer官网下载对应版本CuDNN并在本地解压后,在cmake编译命令中增加`-DCUDNN_LIBRARY`参数,指定CuDNN库所在路径。 从NVIDIA developer官网下载对应版本CuDNN并在本地解压后,在cmake编译命令中增加`-DCUDNN_LIBRARY`参数,指定CuDNN库所在路径。
## 编译安装OpenCV库 ## 附:编译安装OpenCV库
**注意:** 只有当您需要在C++代码中引入OpenCV库时,您才需要这样做。 **注意:** 只有当您需要在C++代码中引入OpenCV库时,您才需要这样做。
* 首先需要从OpenCV官网上下载在Linux环境下源码编译的包,以OpenCV3.4.7为例,下载命令如下。 * 首先需要从OpenCV官网上下载在Linux环境下源码编译的包,以OpenCV3.4.7为例,下载命令如下。
......
# How to compile PaddleServing
([简体中文](./Compile_CN.md)|English)
## Overview
Compiling Paddle Serving is divided into the following steps
- Compilation Environment Preparation: According to the needs of the model and operating environment, select the most suitable image
- Download the Serving Code Repo: Download the Serving code library, and perform initialization operations as needed
- Environment Variable Preparation: According to the needs of the running environment, determine the various environment variables of Python. For example, the GPU environment also needs to determine the environment variables such as Cuda, Cudnn, TensorRT and so on.
- Compilation: Compile `paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app` related whl packages
- Install Related Whl Packages: install the three compiled whl packages, and set the SERVING_BIN environment variable
In addition, for some C++ secondary development scenarios, we also provide OPENCV binding solutions.
## Compilation Environment Requirements
| module | version |
| :--------------------------: | :-------------------------------: |
| OS | Ubuntu16 and 18/CentOS 7 |
| gcc | 5.4.0(Cuda 10.1) and 8.2.0 |
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later |
| Python | 3.6.0 and later |
| Go | 1.17.2 and later |
| git | 2.17.1 and later |
| glibc-static | 2.17 |
| openssl-devel | 1.0.2k |
| bzip2-devel | 1.0.6 and later |
| python3-devel | 3.6.0 and later |
| sqlite-devel | 3.7.17 and later |
| patchelf | 0.9 |
| libXext | 1.3.3 |
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
Docker compilation is recommended. We have prepared the Paddle Serving compilation environment for you and configured the above compilation dependencies. For details, please refer to [this document](DOCKER_IMAGES_CN.md).
We provide five environment development images, namely CPU, Cuda10.1+Cudnn7, Cuda10.2+Cudnn7, Cuda10.2+Cudnn8, Cuda11.2+Cudnn8. We provide a Serving development image to cover the above environment. At the same time, we also support Paddle development mirroring.
The Serving image name is **paddlepaddle/serving:${Serving development image Tag}** (If the network is not good, you can visit **registry.baidubce.com/paddlepaddle/serving:${Serving development image Tag}**), The name of the Paddle development image is **paddlepaddle/paddle:${Paddle Development Image Tag}**. In order to prevent users from confusing the two sets of mirroring, we explain the origin of the two sets of mirroring separately.
Serving development mirror is the mirror used to compile and debug prediction services provided by Serving suite in order to support various prediction environments. Paddle development mirror is the mirror used for compilation, development, and training models released by Paddle on the official website. In order to allow Paddle developers to use Serving directly in the same container. For developers who have already used Serving users in the previous version, Serving development image should not be unfamiliar. But for developers who are familiar with the Paddle training framework ecology, they should be more familiar with the existing Paddle development mirrors. In order to adapt to the different habits of all users, we have fully supported both sets of mirrors.
| Environment | Serving Dev Image Tag | OS | Paddle Dev Image Tag | OS |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | Nan | Nan |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | Nan | Nan |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
We first need to pull related images for the environment we need. Under the **Environment** column in the above table, except for the CPU, the rest (Cuda**+Cudnn**) belong to the GPU environment.
You can use Serving Dev Images.
```
docker pull paddlepaddle/serving:${Serving Dev Image Tag}
# For GPU Image
nvidia-docker run --rm -it paddlepaddle/serving:${Serving Dev Image Tag} bash
# For CPU Image
docker run --rm -it paddlepaddle/serving:${Serving Dev Image Tag} bash
```
You can also use Paddle Dev Images.
## Download the Serving Code Repo
**Note: If you are using Paddle to develop the image, you need to manually run `bash env_install.sh` after downloading the code base (as shown in the third line of the code box)**
```
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
# Paddle development image needs to run the following commands, Serving development image does not need to run
bash tools/paddle_env_install.sh
```
## Environment Variables Preparation
**Set PYTHON environment variable**
If you are using a Serving development image, please follow the steps below to determine the Python version that needs to be compiled and set the corresponding environment variables. A total of three environment variables need to be set, namely `PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`. Below we take python 3.7 as an example to introduce how to set these three environment variables.
1) Set `PYTHON_INCLUDE_DIR`
Search the directory where Python.h is located
```
find / -name Python.h
```
Usually there will be something like `**/include/python3.7/Python.h`, we only need to take its folder directory, for example, find `/usr/include/python3.7/Python.h`, Then we only need `export PYTHON_INCLUDE_DIR=/usr/include/python3.7/`.
If not found. Explanation 1) The development version of Python is not installed and needs to be re-installed. 2) Insufficient permissions cannot view the relevant system directories.
2) Set `PYTHON_LIBRARIES`
Search for libpython3.7.so
```
find / -name libpython3.7.so
```
Usually there will be something similar to `**/lib/libpython3.7.so` or `**/lib/x86_64-linux-gnu/libpython3.7.so`, we only need to take its folder directory, For example, find `/usr/local/lib/libpython3.7.so`, then we only need `export PYTHON_LIBRARIES=/usr/local/lib`.
If it is not found, it means 1) Statically compiling Python, you need to reinstall the dynamically compiled Python 2) The county is not enough to view the relevant system catalogs.
3) Set `PYTHON_EXECUTABLE`
View the python3.7 path directly
```
which python3.7
```
If the result is `/usr/local/bin/python3.7`, then directly set `export PYTHON_EXECUTABLE=/usr/local/bin/python3.7`.
It is very important to set these three environment variables. After the settings are completed, we can perform the following operations (the following is the PYTHON environment of the development image of Paddle Cuda 11.2, if it is another image, please change the corresponding `PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES` , `PYTHON_EXECUTABLE`).
```
# The following three environment variables are the environment of Paddle development mirror Cuda11.2, such as other mirrors may need to be modified
export PYTHON_INCLUDE_DIR=/usr/include/python3.7m/
export PYTHON_LIBRARIES=/usr/lib/x86_64-linux-gnu/libpython3.7m.so
export PYTHON_EXECUTABLE=/usr/bin/python3.7
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
python -m install -r python/requirements.txt
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
go install google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
If you are a GPU user, you need to set additional `CUDA_PATH`, `CUDNN_LIBRARY`, `CUDA_CUDART_LIBRARY` and `TENSORRT_LIBRARY_PATH`.
```
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/"
```
The meaning of environment variables is shown in the table below.
| cmake environment variable | meaning | GPU environment considerations | whether Docker environment is needed |
|-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda installation path, usually /usr/local/cuda | Required for all environments | No (/usr/local/cuda) |
| CUDNN_LIBRARY | The directory where libcudnn.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | The directory where libcudart.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | The upper level directory of the directory where libnvinfer.so.* is located, depends on the TensorRT installation directory | Required for all environments | No (/usr) |
## Compilation
We need to compile three targets in total, namely `paddle-serving-server`, `paddle-serving-client`, and `paddle-serving-app`, among which `paddle-serving-server` needs to distinguish between CPU or GPU version. If it is a CPU version, please run,
### Compile paddle-serving-server
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON \
-DWITH_GPU=OFF ..
make -j20
cd ..
```
If it is the GPU version, please run,
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCUDA_TOOLKIT_ROOT_DIR=${CUDA_PATH} \
-DCUDNN_LIBRARY=${CUDNN_LIBRARY} \
-DCUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY} \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \
-DWITH_GPU=ON ..
make -j20
cd ..
```
### Compile paddle-serving-client and paddle-serving-app
Next, we can continue to compile the client and app. The compilation commands for these two packages are common on all platforms, and do not distinguish between CPU and GPU versions.
```
# Compile paddle-serving-client
mkdir build_client
cd build_client
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON ..
make -j10
cd ..
# Compile paddle-serving-app
mkdir build_app
cd build_app
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON ..
make -j10
cd ..
```
## Install Related Whl Packages
```
pip3.7 install -r build_server/python/dist/*.whl
pip3.7 install -r build_client/python/dist/*.whl
pip3.7 install -r build_app/python/dist/*.whl
export SERVING_BIN=${PWD}/build_server/core/general-server/serving
```
## Precautions
Note the last line `export SERVING_BIN` in the previous section. When running the python server, the `SERVING_BIN` environment variable will be checked. If you want to use the binary file compiled by yourself, please set the environment variable to the path of the corresponding binary file. It is `export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`.
Where BUILD_DIR is the absolute path of `build_server`.
You can cd build_server path and execute `export SERVING_BIN=${PWD}/core/general-server/serving`
## Enable WITH_OPENCV option to compile C++ Server
**Note:** You only need to do this when you need to do secondary development on the Paddle Serving C++ part and the newly added code depends on the OpenCV library.
To compile the Serving C++ Server part, when the WITH_OPENCV option is turned on, the installed OpenCV library is required. If it has not been installed, you can refer to the instructions at the back of this document to compile and install the OpenCV library.
Take the WITH_OPENCV option and compile the CPU version Paddle Inference Library as an example. On the basis of the above compilation command, add the `DOPENCV_DIR=${OPENCV_DIR}` and `DWITH_OPENCV=ON` options.
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir` is the installation path of the opencv library.
mkdir build_server && cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#Notes) ).
## Attached: CMake option description
| Compilation Options | Description | Default |
| :--------------: | :------------------------------- ---------: | :--: |
| WITH_AVX | Compile Paddle Serving with AVX intrinsics | OFF |
| WITH_MKL | Compile Paddle Serving with MKL support | OFF |
| WITH_GPU | Compile Paddle Serving with NVIDIA GPU | OFF |
| WITH_TRT | Compile Paddle Serving with TensorRT | OFF |
| WITH_OPENCV | Compile Paddle Serving with OPENCV | OFF |
| CUDNN_LIBRARY | Define CuDNN library and header path | |
| CUDA_TOOLKIT_ROOT_DIR | Define CUDA PATH | |
| TENSORRT_ROOT | Define TensorRT PATH | |
| CLIENT | Compile Paddle Serving Client | OFF |
| SERVER | Compile Paddle Serving Server | OFF |
| APP | Compile Paddle Serving App package | OFF |
| PACK | Compile for whl | OFF |
### WITH_GPU option
Paddle Serving supports prediction on the GPU through the PaddlePaddle prediction library. The WITH_GPU option is used to detect basic libraries such as CUDA/CUDNN on the system. If a suitable version is detected, the GPU version of the OP Kernel will be compiled when the PaddlePaddle is compiled.
To compile the Paddle Serving GPU version on bare metal, you need to install these basic libraries:
-CUDA
-CuDNN
To compile the TensorRT version, you need to install the TensorRT library.
The things to note here are:
1. Compile the basic library versions such as CUDA/CUDNN installed on the system where Serving is located, and need to be compatible with the actual GPU device. For example, Tesla V100 card requires at least CUDA 9.0. If the version of basic libraries such as CUDA used during compilation is too low, the Serving process cannot be started due to the incompatibility between the generated GPU code and the actual hardware device, or serious problems such as coredump may occur.
2. Install the CUDA driver compatible with the actual GPU device on the system running Paddle Serving, and install the basic library compatible with the CUDA/CuDNN version used during compilation. If the version of CUDA/CuDNN installed on the system running Paddle Serving is lower than the version used during compilation, it may cause strange cuda function call failures and other problems.
The following is the matching relationship between PaddleServing mirrored Cuda, Cudnn, and TensorRT for reference:
| | CUDA | CuDNN | TensorRT |
| :----: | :-----: | :----------: | :----: |
| post101 | 10.1 | CuDNN 7.6.5 | 6.0.1 |
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### Attachment: How to make the Paddle Serving compilation system detect the CuDNN library
After downloading the corresponding version of CuDNN from the official website of NVIDIA developer and decompressing it locally, add the `-DCUDNN_LIBRARY` parameter to the cmake compilation command and specify the path of the CuDNN library.
## Attachment: Compile and install OpenCV library
**Note:** You only need to do this when you need to include the OpenCV library in your C++ code.
* First, you need to download the package compiled from the source code in the Linux environment from the OpenCV official website. Take OpenCV 3.4.7 as an example. The download command is as follows.
```
wget https://github.com/opencv/opencv/archive/3.4.7.tar.gz
tar -xf 3.4.7.tar.gz
```
Finally, you can see the folder `opencv-3.4.7/` in the current directory.
* Compile OpenCV, set the OpenCV source path (`root_path`) and installation path (`install_path`). Enter the OpenCV source code path and compile in the following way.
```shell
root_path=your_opencv_root_path
install_path=${root_path}/opencv3
rm -rf build
mkdir build
cd build
cmake .. \
-DCMAKE_INSTALL_PREFIX=${install_path} \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DWITH_IPP=OFF \
-DBUILD_IPP_IW=OFF \
-DWITH_LAPACK=OFF \
-DWITH_EIGEN=OFF \
-DCMAKE_INSTALL_LIBDIR=lib64 \
-DWITH_ZLIB=ON \
-DBUILD_ZLIB=ON \
-DWITH_JPEG=ON \
-DBUILD_JPEG=ON \
-DWITH_PNG=ON \
-DBUILD_PNG=ON \
-DWITH_TIFF=ON \
-DBUILD_TIFF=ON
make -j
make install
```
Among them, `root_path` is the downloaded OpenCV source path, `install_path` is the installation path of OpenCV, after the completion of `make install`, OpenCV header files and library files will be generated in this folder, which are used to compile the code that references the OpenCV library .
The final file structure under the installation path is as follows.
```
opencv3/
|-- bin
|-- include
|-- lib
|-- lib64
|-- share
```
...@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu ...@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu
1. Build and test 1. Build and test
Users can build Paddle Serving natively on Linux, see the [BUILD steps](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md). Users can build Paddle Serving natively on Linux, see the [BUILD steps](Compile_EN.md).
1. Keep pulling 1. Keep pulling
......
# 稀疏参数索引服务Cube单机版使用指南 # 稀疏参数索引服务Cube单机版使用指南
(简体中文|[English](./CUBE_LOCAL.md)) (简体中文|[English](./Cube_Local_EN.md))
## 引言 ## 引言
在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。 在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](CUBE_LOCAL_CN.md)(正在建设中)。--> <!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](Cube_Local_CN.md)(正在建设中)。-->
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./CUBE_QUANT_CN.md) 本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./Cube_Quant_CN.md)
## 示例 ## 示例
python/example/criteo_ctr_with_cube下执行 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/下执行
``` ```
python local_train.py # 训练模型 python local_train.py # 训练模型
cp ../../../build_server/core/predictor/seq_generator seq_generator #复制Sequence File模型生成工具 cp ../../../build_server/core/predictor/seq_generator seq_generator #复制Sequence File模型生成工具
...@@ -96,7 +96,7 @@ cd cube ...@@ -96,7 +96,7 @@ cd cube
## 注: 配置文件 ## 注: 配置文件
python/examples/criteo_ctr_with_cube/cube/conf下的cube.conf示例,此文件被上述的cube-cli所使用,单机版用户可以直接使用不用关注此部分,它在分布式部署中更为重要。 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/cube/conf下的cube.conf示例,此文件被上述的cube-cli所使用,单机版用户可以直接使用不用关注此部分,它在分布式部署中更为重要。
``` ```
[{ [{
......
# Cube: Sparse Parameter Indexing Service (Local Mode) # Cube: Sparse Parameter Indexing Service (Local Mode)
([简体中文](./CUBE_LOCAL_CN.md)|English) ([简体中文](./Cube_Local_CN.md)|English)
## Overview ## Overview
There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services. There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services.
The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos. The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos.
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) after reading this document (still developing).--> <!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) after reading this document (still developing).-->
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md)
## Example ## Example
in directory python/example/criteo_ctr_with_cube, run in directory Serving/examples/C++/PaddleRec/criteo_ctr_with_cube, run
``` ```
python local_train.py # train model python local_train.py # train model
...@@ -95,7 +95,7 @@ If you see that each key has a corresponding value output, it means that the del ...@@ -95,7 +95,7 @@ If you see that each key has a corresponding value output, it means that the del
## Appendix: Configuration ## Appendix: Configuration
the config file is cube.config located in python/examples/criteo_ctr_with_cube/cube/conf, this file is used by cube-cli.the Cube Local Mode users do not need to understand that just use it, it would be quite important in Cube Distributed Mode. the config file is cube.config located in Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/cube/conf, this file is used by cube-cli.the Cube Local Mode users do not need to understand that just use it, it would be quite important in Cube Distributed Mode.
``` ```
[{ [{
......
# Cube稀疏参数索引量化存储使用指南 # Cube稀疏参数索引量化存储使用指南
(简体中文|[English](./CUBE_QUANT.md)) (简体中文|[English](./Cube_Quant_EN.md))
## 总体概览 ## 总体概览
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
## 前序要求 ## 前序要求
请先读取 [稀疏参数索引服务Cube单机版使用指南](./CUBE_LOCAL_CN.md) 请先读取 [稀疏参数索引服务Cube单机版使用指南](./Cube_Local_CN.md)
## 组件介绍 ## 组件介绍
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
在Serving主目录下,到criteo_ctr_with_cube目录下训练出模型 在Serving主目录下,到criteo_ctr_with_cube目录下训练出模型
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py # 生成模型 python local_train.py # 生成模型
``` ```
接下来可以使用量化和非量化两种方式去生成Sequence File用于Cube稀疏参数索引。 接下来可以使用量化和非量化两种方式去生成Sequence File用于Cube稀疏参数索引。
...@@ -34,11 +34,11 @@ seq_generator ctr_serving_model/SparseFeatFactors ./cube_model/feature 8 #量化 ...@@ -34,11 +34,11 @@ seq_generator ctr_serving_model/SparseFeatFactors ./cube_model/feature 8 #量化
## 用量化模型启动Serving ## 用量化模型启动Serving
在Serving当中,使用general_dist_kv_quant_infer op来进行预测时使用量化模型。具体详见 python/examples/criteo_ctr_with_cube/test_server_quant.py。客户端部分不需要做任何改动。 在Serving当中,使用general_dist_kv_quant_infer op来进行预测时使用量化模型。具体详见 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/test_server_quant.py。客户端部分不需要做任何改动。
为方便用户做demo,我们给出了从0开始启动量化模型Serving。 为方便用户做demo,我们给出了从0开始启动量化模型Serving。
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py python local_train.py
cp ../../../build_server/core/predictor/seq_generator seq_generator cp ../../../build_server/core/predictor/seq_generator seq_generator
cp ../../../build_server/output/bin/cube* ./cube/ cp ../../../build_server/output/bin/cube* ./cube/
......
# Quantization Storage on Cube Sparse Parameter Indexing # Quantization Storage on Cube Sparse Parameter Indexing
([简体中文](./CUBE_QUANT_CN.md)|English) ([简体中文](./Cube_Quant_CN.md)|English)
## Overview ## Overview
...@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati ...@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati
## Precondition ## Precondition
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./CUBE_LOCAL_CN.md) Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./Cube_Local_EN.md)
## Components ## Components
...@@ -21,7 +21,7 @@ This tool is used to convert the Paddle model into a Sequence File. Here, two mo ...@@ -21,7 +21,7 @@ This tool is used to convert the Paddle model into a Sequence File. Here, two mo
In Serving Directory,train the model in the criteo_ctr_with_cube directory In Serving Directory,train the model in the criteo_ctr_with_cube directory
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py # save model python local_train.py # save model
``` ```
Next, you can use quantization and non-quantization to generate Sequence File for Cube sparse parameter indexing. Next, you can use quantization and non-quantization to generate Sequence File for Cube sparse parameter indexing.
...@@ -34,11 +34,11 @@ This command will convert the sparse parameter file SparseFeatFactors in the ctr ...@@ -34,11 +34,11 @@ This command will convert the sparse parameter file SparseFeatFactors in the ctr
## Launch Serving by Quantized Model ## Launch Serving by Quantized Model
In Serving, a quantized model is used when using general_dist_kv_quant_infer op to make predictions. See python/examples/criteo_ctr_with_cube/test_server_quant.py for details. No changes are required on the client side. In Serving, a quantized model is used when using general_dist_kv_quant_infer op to make predictions. See Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/test_server_quant.py for details. No changes are required on the client side.
In order to make the demo easier for users, the following script is to train the quantized criteo ctr model and launch serving by it. In order to make the demo easier for users, the following script is to train the quantized criteo ctr model and launch serving by it.
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py python local_train.py
cp ../../../build_server/core/predictor/seq_generator seq_generator cp ../../../build_server/core/predictor/seq_generator seq_generator
cp ../../../build_server/output/bin/cube* ./cube/ cp ../../../build_server/output/bin/cube* ./cube/
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
### 背景知识 ### 背景知识
推荐系统需要大规模稀疏参数索引来帮助分布式部署,可在`python/example/criteo_ctr_with_cube`或是[PaddleRec](https://github.com/paddlepaddle/paddlerec)了解推荐模型。 推荐系统需要大规模稀疏参数索引来帮助分布式部署,可在`Serving/examples/C++/PaddleRec/criteo_ctr_with_cube`或是[PaddleRec](https://github.com/paddlepaddle/paddlerec)了解推荐模型。
稀疏参数索引的模型格式是SequenceFile,源自Hadoop生态的键值对格式文件。 稀疏参数索引的模型格式是SequenceFile,源自Hadoop生态的键值对格式文件。
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
### 预备知识 ### 预备知识
- 需要会编译Paddle Serving,参见[编译文档](./COMPILE.md) - 需要会编译Paddle Serving,参见[编译文档](./Compile_EN.md)
### 用法 ### 用法
......
# Docker 镜像 # Docker 镜像
(简体中文|[English](DOCKER_IMAGES.md)) (简体中文|[English](Docker_Images_EN.md))
该文档维护了 Paddle Serving 提供的镜像列表。 该文档维护了 Paddle Serving 提供的镜像列表。
...@@ -39,7 +39,7 @@ ...@@ -39,7 +39,7 @@
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) | | GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) | | GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) | | GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11-cudnn8-devel | [Dockerfile.cuda11-cudnn8.devel](../tools/Dockerfile.cuda11-cudnn8.devel) | | GPU (cuda11.2-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**Java镜像:** **Java镜像:**
``` ```
...@@ -80,7 +80,7 @@ registry.baidubce.com/paddlepaddle/serving:xpu-x86 # for x86 xpu user ...@@ -80,7 +80,7 @@ registry.baidubce.com/paddlepaddle/serving:xpu-x86 # for x86 xpu user
运行镜像: 运行镜像:
运行镜像比开发镜像更加轻量化, 运行镜像提供了serving的whl和bin,但为了运行期更小的镜像体积,没有提供诸如cmake这样但开发工具。 如果您想了解有关信息,请检查文档[在Kubernetes上使用Paddle Serving](PADDLE_SERVING_ON_KUBERNETES.md) 运行镜像比开发镜像更加轻量化, 运行镜像提供了serving的whl和bin,但为了运行期更小的镜像体积,没有提供诸如cmake这样但开发工具。 如果您想了解有关信息,请检查文档[在Kubernetes上使用Paddle Serving](./Run_On_Kubernetes_CN.md)
| ENV | Python Version | Tag | | ENV | Python Version | Tag |
|------------------------------------------|----------------|-----------------------------| |------------------------------------------|----------------|-----------------------------|
......
# Docker Images # Docker Images
([简体中文](DOCKER_IMAGES_CN.md)|English) ([简体中文](Docker_Images_CN.md)|English)
This document maintains a list of docker images provided by Paddle Serving. This document maintains a list of docker images provided by Paddle Serving.
...@@ -37,7 +37,7 @@ If you want to customize your Serving based on source code, use the version with ...@@ -37,7 +37,7 @@ If you want to customize your Serving based on source code, use the version with
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel(not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) | | GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel(not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) | | GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) | | GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11-cudnn8-devel | [Dockerfile.cuda11-cudnn8.devel](../tools/Dockerfile.cuda11-cudnn8.devel) | | GPU (cuda11.2-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**Java Client:** **Java Client:**
``` ```
...@@ -76,7 +76,8 @@ Develop Images: ...@@ -76,7 +76,8 @@ Develop Images:
Running Images: Running Images:
Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes.](PADDLE_SERVING_ON_KUBERNETES.md). Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes.](./Run_On_Kubernetes_CN.md).
| ENV | Python Version | Tag | | ENV | Python Version | Tag |
|------------------------------------------|----------------|-----------------------------| |------------------------------------------|----------------|-----------------------------|
......
...@@ -142,7 +142,7 @@ make: *** [all] Error 2 ...@@ -142,7 +142,7 @@ make: *** [all] Error 2
#### Q:使用过程中出现CXXABI错误。 #### Q:使用过程中出现CXXABI错误。
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./RUN_IN_DOCKER_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式 这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./Run_In_Docker_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
```bash ```bash
wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz
...@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \ ...@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
(1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15` (1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](DOCKER_IMAGES_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8). (2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh). (3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
...@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is: ...@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is:
#### Q: 目前Paddle Serving支持哪些镜像环境? #### Q: 目前Paddle Serving支持哪些镜像环境?
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/DOCKER_IMAGES.md) **A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
#### Q: python编译的GCC版本与serving的版本不匹配 #### Q: python编译的GCC版本与serving的版本不匹配
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/RUN_IN_DOCKER.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77) **A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Run_In_Docker_CN.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
#### Q: paddle-serving是否支持本地离线安装 #### Q: paddle-serving是否支持本地离线安装
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md)提前准备安装好 **A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好
#### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异 #### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。 **A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。
...@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"]) ...@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"])
#### Q: 如何使用多语言客户端 #### Q: 如何使用多语言客户端
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/JAVA_SDK_CN.md) **A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md)
#### Q: 如何在Windows下使用Paddle Serving #### Q: 如何在Windows下使用Paddle Serving
......
# gRPC接口使用介绍
- [1.与bRPC接口对比](#1与brpc接口对比)
- [1.1 服务端对比](#11-服务端对比)
- [1.2 客服端对比](#12-客服端对比)
- [1.3 其他](#13-其他)
- [2.示例:线性回归预测服务](#2示例线性回归预测服务)
- [获取数据](#获取数据)
- [开启 gRPC 服务端](#开启-grpc-服务端)
- [客户端预测](#客户端预测)
- [同步预测](#同步预测)
- [异步预测](#异步预测)
- [Batch 预测](#batch-预测)
- [通用 pb 预测](#通用-pb-预测)
- [预测超时](#预测超时)
- [List 输入](#list-输入)
- [3.更多示例](#3更多示例)
使用gRPC接口,Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下:
![](images/grpc_impl.png)
## 1.与bRPC接口对比
#### 1.1 服务端对比
* 由于gRPC Server 端实际包含了brpc-Client端的,因此brpc-Client的初始化过程是在gRPC Server 端实现的,所以gRPC Server 端 `load_model_config` 函数添加 `client_config_path` 参数,用于指定brpc-Client初始化过程中的传输数据格式配置文件路径(`client_config_path` 参数未指定时默认为None,此时`client_config_path``load_model_config` 函数中被默认为 `<server_config_path>/serving_server_conf.prototxt`,此时brpc-Client与brpc-Server的传输数据格式配置文件相同)
```
def load_model_config(self, server_config_paths, client_config_path=None)
```
在一些例子中 bRPC Server 端与 bRPC Client 端的配置文件可能不同(如 在cube local 中,Client 端的数据先交给 cube,经过 cube 处理后再交给预测库),此时 gRPC Server 端需要手动设置 gRPC Client 端的配置`client_config_path`
#### 1.2 客服端对比
* gRPC Client 端取消 `load_client_config` 步骤:
`connect` 步骤通过 RPC 获取相应的 prototxt(从任意一个 endpoint 获取即可)。
* gRPC Client 需要通过 RPC 方式设置 timeout 时间(调用形式与 bRPC Client保持一致)
因为 bRPC Client 在 `connect` 后无法更改 timeout 时间,所以当 gRPC Server 收到变更 timeout 的调用请求时会重新创建 bRPC Client 实例以变更 bRPC Client timeout时间,同时 gRPC Client 会设置 gRPC 的 deadline 时间。
**注意,设置 timeout 接口和 Inference 接口不能同时调用(非线程安全),出于性能考虑暂时不加锁。**
* gRPC Client 端 `predict` 函数添加 `asyn``is_python` 参数:
```
def predict(self, feed, fetch, batch=True, need_variant_tag=False, asyn=False, is_python=True,log_id=0)
```
1. `asyn` 为异步调用选项。当 `asyn=True` 时为异步调用,返回 `MultiLangPredictFuture` 对象,通过 `MultiLangPredictFuture.result()` 阻塞获取预测值;当 `asyn=Fasle` 为同步调用。
2. `is_python` 为 proto 格式选项。当 `is_python=True` 时,基于 numpy bytes 格式进行数据传输,目前只适用于 Python;当 `is_python=False` 时,以普通数据格式传输,更加通用。使用 numpy bytes 格式传输耗时比普通数据格式小很多(详见 [#654](https://github.com/PaddlePaddle/Serving/pull/654))。
3. `batch`为数据是否需要进行增维处理的选项。当`batch=True`时,feed数据不需要额外的处理,维持原有维度;当`batch=False`时,会对数据进行增维度处理。例如:feed.shape原始为[2,2],当`batch=False`时,会将feed.reshape为[1,2,2]。
#### 1.3 其他
* 异常处理:当 gRPC Server 端的 bRPC Client 预测失败(返回 `None`)时,gRPC Client 端同样返回None。其他 gRPC 异常会在 Client 内部捕获,并在返回的 fetch_map 中添加一个 "status_code" 字段来区分是否预测正常(参考 timeout 样例)。
* 由于 gRPC 只支持 pick_first 和 round_robin 负载均衡策略,ABTEST 特性还未打齐。
* 系统兼容性:
* [x] CentOS
* [x] macOS
* [x] Windows
* 已经支持的客户端语言:
- Python
- Java
- Go
## 2.示例:线性回归预测服务
以下是采用gRPC实现的关于线性回归预测的一个示例,具体代码详见此[链接](../python/examples/grpc_impl_example/fit_a_line)
#### 获取数据
```shell
sh get_data.sh
```
#### 开启 gRPC 服务端
``` shell
python test_server.py uci_housing_model/
```
也可以通过下面的一行代码开启默认 gRPC 服务:
```shell
python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393 --use_multilang
```
注:--use_multilang参数用来启用多语言客户端
### 客户端预测
#### 同步预测
``` shell
python test_sync_client.py
```
#### 异步预测
``` shell
python test_asyn_client.py
```
#### Batch 预测
``` shell
python test_batch_client.py
```
#### 预测超时
``` shell
python test_timeout_client.py
```
## 3.更多示例
详见[`python/examples/grpc_impl_example`](../python/examples/grpc_impl_example)下的示例文件。
# 如何在Paddle Serving使用Go Client # 如何在Paddle Serving使用Go Client
(简体中文|[English](./IMDB_GO_CLIENT.md)) (简体中文|[English](./Imdb_GO_Client_EN.md))
本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。 本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。
......
# How to use Go Client of Paddle Serving # How to use Go Client of Paddle Serving
([简体中文](./IMDB_GO_CLIENT_CN.md)|English) ([简体中文](./Imdb_GO_Client_CN.md)|English)
This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset. This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset.
......
# 使用Docker安装Paddle Serving
(简体中文|[English](./Install_EN.md))
**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](Run_In_Docker_CN.md)。更多镜像请查看[Docker镜像列表](Docker_Images_CN.md)
**提示-1**:本项目仅支持<mark>**Python3.6/3.7/3.8**</mark>,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。
**提示-2**:以下示例中GPU环境均为cuda10.2-cudnn7,如果您使用Python Pipeline来部署,并需要Nvidia TensorRT来优化预测性能,请参考[支持的镜像环境和说明](#4支持的镜像环境和说明)来选择其他版本。
## 1.启动开发镜像
<mark>**同时支持使用Serving镜像和Paddle镜像,1.1和1.2章节中的操作2选1即可。**</mark>
### 1.1 Serving开发镜像(CPU/GPU 2选1)
**CPU:**
```
# 启动 CPU Docker
docker pull paddlepaddle/serving:0.7.0-devel
docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# 启动 GPU Docker
docker pull paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
### 1.2 Paddle开发镜像(CPU/GPU 2选1)
**CPU:**
```
# 启动 CPU Docker
docker pull paddlepaddle/paddle:2.2.0
docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle开发镜像需要执行以下脚本增加Serving所需依赖项
bash Serving/tools/paddle_env_install.sh
```
**GPU:**
```
# 启动 GPU Docker
docker pull paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle开发镜像需要执行以下脚本增加Serving所需依赖项
bash Serving/tools/paddle_env_install.sh
```
## 2.安装Paddle Serving相关Python库
安装所需的pip依赖
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.7.0
pip3 install paddle-serving-server==0.7.0 # CPU
pip3 install paddle-serving-app==0.7.0
pip3 install paddle-serving-server-gpu==0.7.0.post102 #GPU with CUDA10.2 + TensorRT6
# 其他GPU环境需要确认环境再选择执行哪一条
pip3 install paddle-serving-server-gpu==0.7.0.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.7.0.post112 # GPU with CUDA11.2 + TensorRT8
```
您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。
如果需要使用develop分支编译的安装包,请从[最新安装包列表](./Latest_Packages_CN.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](./Compile_CN.md)
paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。
## 3.安装Paddle相关Python库
**当您使用`paddle_serving_client.convert`命令或者`Python Pipeline框架`时才需要安装。**
```
# CPU环境请执行
pip3 install paddlepaddle==2.2.0
# GPU Cuda10.2环境请执行
pip3 install paddlepaddle-gpu==2.2.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle-Inference官方文档-下载安装Linux预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)选择相应的GPU环境的url链接并进行安装。
例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``linux-cuda10.1-cudnn7.6-trt6-gcc8.2`对应的url,复制下来并执行
```
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.0.post101-cp36-cp36m-linux_x86_64.whl
```
## 4.支持的镜像环境和说明
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
# Install Paddle Serving with Docker
([简体中文](./Install_CN.md)|English)
**Strongly recommend** you build **Paddle Serving** in Docker, please check [How to run PaddleServing in Docker](Run_In_Docker_CN.md). For more images, please refer to [Docker Image List](Docker_Images_CN.md).
**Tip-1**: This project only supports <mark>**Python3.6/3.7/3.8**</mark>, all subsequent operations related to Python/Pip need to select the correct Python version.
**Tip-2**: The GPU environments in the following examples are all cuda10.2-cudnn7. If you use Python Pipeline to deploy and need Nvidia TensorRT to optimize prediction performance, please refer to [Supported Mirroring Environment and Instructions](#4.-Supported-Docker-Images-and-Instruction) to choose other versions.
## 1. Start the Docker Container
<mark>**Both Serving Dev Image and Paddle Dev Image are supported at the same time. You can choose 1 from the operation 2 in chapters 1.1 and 1.2.**</mark>
### 1.1 Serving Dev Images (CPU/GPU 2 choose 1)
**CPU:**
```
# Start CPU Docker Container
docker pull paddlepaddle/serving:0.7.0-devel
docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# Start GPU Docker Container
docker pull paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
### 1.2 Paddle Dev Images (choose any codeblock of CPU/GPU)
**CPU:**
```
# Start CPU Docker Container
docker pull paddlepaddle/paddle:2.2.0
docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle dev image needs to run the following script to increase the dependencies required by Serving
bash Serving/tools/paddle_env_install.sh
```
**GPU:**
```
# Start GPU Docker
docker pull paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle development image needs to execute the following script to increase the dependencies required by Serving
bash Serving/tools/paddle_env_install.sh
```
## 2. Install Paddle Serving related whl Packages
Install the required pip dependencies
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.7.0
pip3 install paddle-serving-server==0.7.0 # CPU
pip3 install paddle-serving-app==0.7.0
pip3 install paddle-serving-server-gpu==0.7.0.post102 #GPU with CUDA10.2 + TensorRT6
# Other GPU environments need to confirm the environment before choosing which one to execute
pip3 install paddle-serving-server-gpu==0.7.0.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.7.0.post112 # GPU with CUDA11.2 + TensorRT8
```
If you are in China, You may need to use a chinese mirror source (such as Tsinghua source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to the pip command) to speed up the download.
If you need to use the installation package compiled by the develop branch, please download the download address from [Latest installation package list](./Latest_Packages_CN.md), and use the `pip install` command to install. If you want to compile by yourself, please refer to [Paddle Serving Compilation Document](./Compile_CN.md).
The paddle-serving-server and paddle-serving-server-gpu installation packages support Centos 6/7, Ubuntu 16/18 and Windows 10.
The paddle-serving-client and paddle-serving-app installation packages support Linux and Windows, and paddle-serving-client only supports python3.6/3.7/3.8.
## 3. Install Paddle related Python libraries
**You only need to install it when you use the `paddle_serving_client.convert` command or the `Python Pipeline framework`. **
```
# CPU environment please execute
pip3 install paddlepaddle==2.2.0
# GPU Cuda10.2 environment please execute
pip3 install paddlepaddle-gpu==2.2.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle-Inference official document-download and install the Linux prediction library](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python) Select the URL link of the corresponding GPU environment and install it.
For example, for Python3.6 users of Cuda 10.1, please select the URL corresponding to `cp36-cp36m` and `linux-cuda10.1-cudnn7.6-trt6-gcc8.2` in the table, copy it and execute
```
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.0.post101 -cp36-cp36m-linux_x86_64.whl
```
## 4. Supported Docker Images and Instruction
| Environment | Serving Development Image Tag | Operating System | Paddle Development Image Tag | Operating System |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
For **Windows 10 users**, please refer to the document [Paddle Serving Guide for Windows Platform](Windows_Tutorial_CN.md).
# Paddle Serving Client Java SDK # Paddle Serving Client Java SDK
(简体中文|[English](JAVA_SDK.md)) (简体中文|[English](Java_SDK_EN.md))
Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。 Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。
......
# Paddle Serving Client Java SDK # Paddle Serving Client Java SDK
([简体中文](JAVA_SDK_CN.md)|English) ([简体中文](Java_SDK_CN.md)|English)
Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK. Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK.
......
# Lod字段说明 # Lod字段说明
(简体中文|[English](LOD.md)) (简体中文|[English](LOD_EN.md))
## 概念 ## 概念
......
...@@ -41,7 +41,7 @@ https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-n ...@@ -41,7 +41,7 @@ https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-n
``` ```
## Baidu Kunlun user ## Baidu Kunlun user
for kunlun user who uses arm-xpu or x86-xpu can download the wheel packages as follows. Users should use the xpu-beta docker [DOCKER IMAGES](./DOCKER_IMAGES.md) for kunlun user who uses arm-xpu or x86-xpu can download the wheel packages as follows. Users should use the xpu-beta docker [DOCKER IMAGES](./Docker_Images_CN.md)
**We only support Python 3.6 for Kunlun Users.** **We only support Python 3.6 for Kunlun Users.**
### Wheel Package Links ### Wheel Package Links
......
# Paddle Serving低精度部署 ## Paddle Serving低精度部署
(简体中文|[English](./LOW_PRECISION_DEPLOYMENT.md))
(简体中文|[English](./Low_Precision_EN.md))
低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。 低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。
## 通过PaddleSlim量化生成低精度模型 ### 通过PaddleSlim量化生成低精度模型
详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html) 详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html)
## 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署 ### 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。 首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。
``` ```
wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz
...@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"]) ...@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1)) print(fetch_map["score"].reshape(-1))
``` ```
## 参考文档 ### 参考文档
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) * PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Low-Precision Deployment for Paddle Serving ## Low-Precision Deployment for Paddle Serving
(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md))
(English|[简体中文](./Low_Precision_CN.md))
Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models. Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
## Obtain the quantized model through PaddleSlim tool ### Obtain the quantized model through PaddleSlim tool
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html). Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode ### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。 Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
``` ```
...@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0 ...@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1)) print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
``` ```
## Reference ### Reference
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) * [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
此差异已折叠。
# Model Zoo
(English|[简体中文](./Model_Zoo_CN.md)
This page lists model archives that are pre-trained and pre-packaged, ready to be served for inference with PaddleServing.
To propose a model for inclusion, please submit [pull request](https://github.com/PaddlePaddle/Serving/pulls)
Special thanks to the [Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain) and [PaddleHub](https://www.paddlepaddle.org.cn/hub) whose Model Zoo and Model Examples were used in generating these model archives
| Model | Type | Framework | Download |
| --- | --- | --- | ---- |
| resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
| bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
| blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
| cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
| fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
| ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
- Refer [example](../examples) for more details on above models.
- Refer [wholechain](https://www.paddlepaddle.org.cn/wholechain) for more pre-trained models supported by PaddleServing
- Latest models refer
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
- [PaddleRec](https://github.com/PaddlePaddle/PaddleRec)
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)
# Performance Optimization
([简体中文](./PERFORMANCE_OPTIM_CN.md)|English)
Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computation-intensive services.
For a prediction service, the easiest way to determine the type of service is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service.
For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction.
For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services.
Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication.
Parameters for performance optimization:
The memory/graphic memory optimization option is enabled by default in Paddle Serving, which can reduce the memory/video memory usage and usually does not affect performance. If you need to turn it off, you can use --mem_optim_off in the command line.
r_optim can optimize the calculation graph and increase the inference speed. It is turned off by default and turned on by --ir_optim in the command line.
| Parameters | Type | Default | Description |
| ---------- | ---- | ------- | ------------------------------------------------------------ |
| mem_optim_off | - | - | Disable memory / graphic memory optimization |
| ir_optim | - | - | Enable analysis and optimization of calculation graph,including OP fusion, etc |
For the mode of using Python code to start the prediction service, the API of the above two parameters is as follows:
RPC Service
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP Service
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
# 性能优化
(简体中文|[English](./PERFORMANCE_OPTIM.md))
由于模型结构的不同,在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化
对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。
对于通信密集型的预测服务,可以将请求进行聚合,在对延时可以容忍的限度内,将多个预测请求合并成一个batch进行预测。
对于计算密集型的预测服务,可以使用GPU预测服务代替CPU预测服务,或者增加GPU预测服务的显卡数量。
在相同条件下,Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的,因此对于通信密集型的服务请优先考虑使用RPC的通信方式。
性能优化相关参数:
Paddle Serving中默认开启内存/显存优化选项,可以减少对内存/显存的占用,通常不会对性能造成影响,如果需要关闭可以在命令行启动模式中使用--mem_optim_off。
ir_optim可以优化计算图,提升推理速度,默认关闭,在命令行启动的模式中通过--ir_optim开启。
| 参数 | 类型 | 默认值 | 含义 |
| --------- | ---- | ------ | -------------------------------- |
| mem_optim_off | - | - | 关闭内存/显存优化 |
| ir_optim | - | - | 开启计算图分析优化,包括OP融合等 |
对于使用Python代码启动预测服务的模式,以上两个参数的接口如下:
RPC服务
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP服务
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
# Pipeline Serving 性能优化
([English](./Performance_Tuning_EN.md)|简体中文)
## 1. 性能分析与优化
### 1.1 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 1.2 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制:
```yaml
dag:
use_profile: true
```
开启该功能后,Server 端在预测的过程中会将对应的日志信息打印到标准输出,为了更直观地展现各阶段的耗时,提供 Analyst 模块对日志文件做进一步的分析处理。
使用时先将 Server 的输出保存到文件,以 `profile.txt` 为例,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件,`trace` 文件可以通过 chrome 浏览器的 tracing 功能进行可视化。
```python
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 1.3 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 1.4 分析方法
根据pipeline.tracer日志中的各个阶段耗时,按以下公式逐步分析出主要耗时在哪个阶段。
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
Channel堆积:
channel_acc_size = QPS(down - up) * time
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 优化思路
根据长耗时在不同阶段,采用不同的优化方法.
- OP推理阶段(mid-process):
- 增加OP并发度
- 开启auto-batching(前提是多个请求的shape一致)
- 若批量数据中某条数据的shape很大,padding很大导致推理很慢,可使用mini-batch
- 开启TensorRT/MKL-DNN优化
- 开启低精度推理
- OP前处理阶段(pre-process):
- 增加OP并发度
- 优化前处理逻辑
- in/out耗时长(channel堆积>5)
- 检查channel传递的数据大小和延迟
- 优化传入数据,不传递数据或压缩后再传入
- 增加OP并发度
- 减少上游OP并发度
# Pipeline Serving Performance Optimization
(English|[简体中文](./Performance_Tuning_CN.md))
## 1. Performance analysis and optimization
### 1.1 How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### 1.2 Output profile information on server side
The server is controlled by the `use_profile` field in yaml:
```yaml
dag:
use_profile: true
```
After the function is enabled, the server will print the corresponding log information to the standard output in the process of prediction. In order to show the time consumption of each stage more intuitively, Analyst module is provided for further analysis and processing of log files.
The output of the server is first saved to a file. Taking `profile.txt` as an example, the script converts the time monitoring information in the log into JSON format and saves it to the `trace` file. The `trace` file can be visualized through the tracing function of Chrome browser.
```shell
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### 1.3 Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 1.4 Analytical methods
According to the time consumption of each stage in the pipeline.tracer log, the following formula is used to gradually analyze which stage is the main time consumption.
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 Optimization ideas
According to the long time consuming in stages below, different optimization methods are adopted.
- OP Inference stage(mid-process):
- Increase `concurrency`
- Turn on `auto-batching`(Ensure that the shapes of multiple requests are consistent)
- Use `mini-batch`, If the shape of data is very large.
- Turn on TensorRT for GPU
- Turn on MKLDNN for CPU
- Turn on low precison inference
- OP preprocess or postprocess stage:
- Increase `concurrency`
- Optimize processing logic
- In/Out stage(channel accumulation > 5):
- Check the size and delay of the data passed by the channel
- Optimize the channel to transmit data, do not transmit data or compress it before passing it in
- Increase `concurrency`
- Decrease `concurrency` upstreams.
# Pipeline Serving # Pipeline Serving
(简体中文|[English](PIPELINE_SERVING.md)) (简体中文|[English](Pipeline_Design_EN.md))
- [架构设计](PIPELINE_SERVING_CN.md#1架构设计) - [架构设计](Pipeline_Design_CN.md#1架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2详细设计) - [详细设计](Pipeline_Design_CN.md#2详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3典型示例) - [典型示例](Pipeline_Design_CN.md#3典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法) - [高阶用法](Pipeline_Design_CN.md#4高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪) - [日志追踪](Pipeline_Design_CN.md#5日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化) - [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。 在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
...@@ -316,14 +316,14 @@ class ResponseOp(Op): ...@@ -316,14 +316,14 @@ class ResponseOp(Op):
*** ***
## 3.典型示例 ## 3.典型示例
所有Pipeline示例在[examples/pipeline/](../python/examples/pipeline) 目录下,目前有7种类型模型示例: 所有Pipeline示例在[examples/Pipeline/](../../examples/Pipeline) 目录下,目前有7种类型模型示例:
- [PaddleClas](../python/examples/pipeline/PaddleClas) - [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../python/examples/pipeline/PaddleDetection) - [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../python/examples/pipeline/bert) - [bert](../../examples/Pipeline/bert)
- [imagenet](../python/examples/pipeline/imagenet) - [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../python/examples/pipeline/imdb_model_ensemble) - [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../python/examples/pipeline/ocr) - [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../python/examples/pipeline/simple_web_service) - [simple_web_service](../../examples/Pipeline/simple_web_service)
以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示: 以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
......
# Pipeline Serving # Pipeline Serving
([简体中文](PIPELINE_SERVING_CN.md)|English) ([简体中文](Pipeline_Design_CN.md)|English)
- [Architecture Design](PIPELINE_SERVING.md#1architecture-design) - [Architecture Design](Pipeline_Design_EN.md#1architecture-design)
- [Detailed Design](PIPELINE_SERVING.md#2detailed-design) - [Detailed Design](Pipeline_Design_EN.md#2detailed-design)
- [Classic Examples](PIPELINE_SERVING.md#3classic-examples) - [Classic Examples](Pipeline_Design_EN.md#3classic-examples)
- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages) - [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages)
- [Log Tracing](PIPELINE_SERVING.md#5log-tracing) - [Log Tracing](Pipeline_Design_EN.md#5log-tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization) - [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low. In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
...@@ -311,14 +311,14 @@ The default implementation of **pack_response_package** is to convert the dictio ...@@ -311,14 +311,14 @@ The default implementation of **pack_response_package** is to convert the dictio
## 3.Classic Examples ## 3.Classic Examples
All examples of pipelines are in [examples/pipeline/](../python/examples/pipeline) directory, There are 7 types of model examples currently: All examples of pipelines are in [examples/pipeline/](../../examples/Pipeline) directory, There are 7 types of model examples currently:
- [PaddleClas](../python/examples/pipeline/PaddleClas) - [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../python/examples/pipeline/PaddleDetection) - [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../python/examples/pipeline/bert) - [bert](../../examples/Pipeline/bert)
- [imagenet](../python/examples/pipeline/imagenet) - [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../python/examples/pipeline/imdb_model_ensemble) - [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../python/examples/pipeline/ocr) - [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../python/examples/pipeline/simple_web_service) - [simple_web_service](../../examples/Pipeline/simple_web_service)
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure: Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
......
## Paddle Serving 快速开始示例
([English](./Quick_Start_EN.md)|简体中文)
这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。
<h3 align="center">波士顿房价预测</h3>
进入到Serving的git目录下,进入到`fit_a_line`例子
``` shell
cd Serving/examples/C++/fit_a_line
sh get_data.sh
```
Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
<h3 align="center">RPC服务</h3>
用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 异步模型的说明
异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### 当您的一个服务包含两个模型部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8
</center>
``` python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict``fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"``"price"`
<h3 align="center">HTTP服务</h3>
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`Serving/examples/C++/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
客户端输入
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
返回结果
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline服务</h3>
Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](../examples/Pipeline/PaddleOCR/ocr/),在目录`python/examples/pipeline/ocr`
我们先获取两个模型
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
然后启动服务端程序,将两个串联的模型作为一个整体的服务。
```
python3 web_service.py
```
最终使用http的方式请求
```
python3 pipeline_http_client.py
```
也支持rpc的方式
```
python3 pipeline_rpc_client.py
```
输出
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
## Paddle Serving Quick Start Examples
(English|[简体中文](./Quick_Start_CN.md))
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
### Boston House Price Prediction model
get into the Serving git directory, and change dir to `fit_a_line`
``` shell
cd Serving/examples/C++/fit_a_line
sh get_data.sh
```
Paddle Serving provides HTTP and RPC based service for users to access
### RPC service
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `4` | Concurrency of current service |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str | `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Only for deployment with TensorRT |
</center>
```python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
import numpy as np
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training.
### WEB service
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `Serving/examples/C++/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
for client side,
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
the response is
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline Service</h3>
Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](../examples/Pipeline/PaddleOCR/ocr/).
we get two models
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
then we start server side, launch two models as one standalone web service
```
python3 web_service.py
```
http request
```
python3 pipeline_http_client.py
```
grpc request
```
python3 pipeline_rpc_client.py
```
output
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
# 如何在Docker中运行PaddleServing # 如何在Docker中运行PaddleServing
(简体中文|[English](RUN_IN_DOCKER.md)) (简体中文|[English](Run_In_Docker_EN.md))
Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。 Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。
...@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker) ...@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker)
### 获取镜像 ### 获取镜像
参考[该文档](DOCKER_IMAGES_CN.md)获取镜像: 参考[该文档](Docker_Images_CN.md)获取镜像:
以CPU编译镜像为例 以CPU编译镜像为例
...@@ -59,9 +59,9 @@ docker exec -it test bash ...@@ -59,9 +59,9 @@ docker exec -it test bash
### 安装PaddleServing ### 安装PaddleServing
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](LATEST_PACKAGES.md) 请参照首页的指导,下载对应版本的pip包。[最新安装包合集](Latest_Packages_CN.md)
## 注意事项 ## 注意事项
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](COMPILE.md) - 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](Compile_CN.md)
- 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。 - 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。
# How to run PaddleServing in Docker # How to run PaddleServing in Docker
([简体中文](RUN_IN_DOCKER_CN.md)|English) ([简体中文](Run_In_Docker_CN.md)|English)
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms. One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
...@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d ...@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d
### Get docker image ### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image: Refer to [this document](Docker_Images_EN.md) for a docker image:
```shell ```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel
...@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe ...@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe
### Get docker image ### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image: Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
```shell ```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
...@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of ...@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services. The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md) If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious ## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md). - Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11. - If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
...@@ -47,12 +47,12 @@ bash tools/generate_runtime_docker.sh --help ...@@ -47,12 +47,12 @@ bash tools/generate_runtime_docker.sh --help
#### Pipeline模式: #### Pipeline模式:
对于pipeline模式,我们需要确保模型和程序文件、配置文件等各种依赖都能够在镜像中运行。因此可以在`/home/project`下存放我们的执行文件时,我们以`Serving/python/example/pipeline/ocr`为例,这是OCR文字识别任务。 对于pipeline模式,我们需要确保模型和程序文件、配置文件等各种依赖都能够在镜像中运行。因此可以在`/home/project`下存放我们的执行文件时,我们以`Serving/examples/Pipeline/PaddleOCR/ocr`为例,这是OCR文字识别任务。
```bash ```bash
#假设您已经拥有Serving运行镜像,假设镜像名为paddle_serving:cuda10.2-py36 #假设您已经拥有Serving运行镜像,假设镜像名为paddle_serving:cuda10.2-py36
docker run --rm -dit --name pipeline_serving_demo paddle_serving:cuda10.2-py36 bash docker run --rm -dit --name pipeline_serving_demo paddle_serving:cuda10.2-py36 bash
cd Serving/python/example/pipeline/ocr cd Serving/examples/Pipeline/PaddleOCR/ocr
# get models # get models
python -m paddle_serving_app.package --get_model ocr_rec python -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz tar -xzvf ocr_rec.tar.gz
...@@ -80,12 +80,12 @@ python3.6 web_service.py ...@@ -80,12 +80,12 @@ python3.6 web_service.py
#### WebService模式: #### WebService模式:
web service模式本质上和pipeline模式类似,因此我们以`Serving/python/examples/bert`为例 web service模式本质上和pipeline模式类似,因此我们以`Serving/examples/C++/PaddleNLP/bert`为例
```bash ```bash
#假设您已经拥有Serving运行镜像,假设镜像名为registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-py36 #假设您已经拥有Serving运行镜像,假设镜像名为registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-py36
docker run --rm -dit --name webservice_serving_demo registry.baidubce.com/paddlepaddle/serving:0.6.0-cpu-py36 bash docker run --rm -dit --name webservice_serving_demo registry.baidubce.com/paddlepaddle/serving:0.6.0-cpu-py36 bash
cd Serving/python/examples/bert cd Serving/examples/C++/PaddleNLP/bert
### download model ### download model
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
......
此差异已折叠。
# 怎样保存用于Paddle Serving的模型? # 怎样保存用于Paddle Serving的模型?
(简体中文|[English](./SAVE.md)) (简体中文|[English](./Save_EN.md))
## 从已保存的模型文件中导出 ## 从已保存的模型文件中导出
如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。 如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
......
# How to save a servable model of Paddle Serving? # How to save a servable model of Paddle Serving?
([简体中文](./SAVE_CN.md)|English) ([简体中文](./Save_CN.md)|English)
## Export from saved model files ## Export from saved model files
......
...@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36 ...@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
### Step 1:启动Serving服务 ### Step 1:启动Serving服务
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./PADDLE_SERVING_ON_KUBERNETES.md) 我们仍然以 [Uci房价预测](../examples/C++/fit_a_line/)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes_CN.md)
在这里我们直接执行 在这里我们直接执行
``` ```
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
## Windows平台使用Paddle Serving指导 ## Windows平台使用Paddle Serving指导
([English](./WINDOWS_TUTORIAL.md)|简体中文) ([English](./Windows_Tutorial_EN.md)|简体中文)
### 综述 ### 综述
...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt ...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt
**运行OCR示例** **运行OCR示例**
``` ```
cd Serving/python/example/ocr cd Serving/examples/C++/PaddleOCR/ocr/
python -m paddle_serving_app.package --get_model ocr_rec python -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz tar -xzvf ocr_rec.tar.gz
python -m paddle_serving_app.package --get_model ocr_det python -m paddle_serving_app.package --get_model ocr_det
...@@ -70,7 +70,7 @@ class YourWebService(WebService): ...@@ -70,7 +70,7 @@ class YourWebService(WebService):
your_service = YourService(name="XXX") your_service = YourService(name="XXX")
your_service.load_model_config("your_model_path") your_service.load_model_config("your_model_path")
your_service.prepare_server(workdir="workdir", port=9292) your_service.prepare_server(workdir="workdir", port=9292)
# 如果是GPU用户,可以参照python/examples/ocr下的python示例 # 如果是GPU用户,可以参照Serving/examples/Pipeline/PaddleOCR/ocr下的python示例
your_service.run_debugger_service() your_service.run_debugger_service()
# Windows平台不可以使用 run_rpc_service()接口 # Windows平台不可以使用 run_rpc_service()接口
your_service.run_web_service() your_service.run_web_service()
...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data)) ...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json()) print(r.json())
``` ```
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./NEW_WEB_SERVICE_CN.md) 用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./C++_Serving/Http_Service_CN.md)
开发完成后执行 开发完成后执行
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册