未验证 提交 1410fea6 编写于 作者: J Jiawei Wang 提交者: GitHub

Merge pull request #1511 from bjjwwang/v0.7.0

cherry-pick 23 PR
...@@ -6,327 +6,151 @@ ...@@ -6,327 +6,151 @@
<br> <br>
<p> <p>
<p align="center"> <p align="center">
<br> <br>
<a href="https://travis-ci.com/PaddlePaddle/Serving"> <a href="https://travis-ci.com/PaddlePaddle/Serving">
<img alt="Build Status" src="https://img.shields.io/travis/com/PaddlePaddle/Serving/develop"> <img alt="Build Status" src="https://img.shields.io/travis/com/PaddlePaddle/Serving/develop?style=flat-square">
<img alt="Docs" src="https://img.shields.io/badge/docs-中文文档-brightgreen?style=flat-square">
<img alt="Release" src="https://img.shields.io/badge/release-0.7.0-blue?style=flat-square">
<img alt="Python" src="https://img.shields.io/badge/python-3.6+-blue?style=flat-square">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/Serving?color=blue&style=flat-square">
<img alt="Forks" src="https://img.shields.io/github/forks/PaddlePaddle/Serving?color=yellow&style=flat-square">
<img alt="Issues" src="https://img.shields.io/github/issues/PaddlePaddle/Serving?color=yellow&style=flat-square">
<img alt="Contributors" src="https://img.shields.io/github/contributors/PaddlePaddle/Serving?color=orange&style=flat-square">
<img alt="Community" src="https://img.shields.io/badge/join-Wechat,QQ,Slack-orange?style=flat-square">
</a> </a>
<img alt="Release" src="https://img.shields.io/badge/Release-0.6.2-yellowgreen">
<img alt="Issues" src="https://img.shields.io/github/issues/PaddlePaddle/Serving">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/Serving">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br> <br>
<p> <p>
- [Motivation](./README.md#motivation) ***
- [AIStudio Tutorial](./README.md#aistuio-tutorial)
- [Installation](./README.md#installation) The goal of Paddle Serving is to provide high-performance, flexible and easy-to-use industrial-grade online inference services for machine learning developers and enterprises.Paddle Serving supports multiple protocols such as RESTful, gRPC, bRPC, and provides inference solutions under a variety of hardware and multiple operating system environments, and many famous pre-trained model examples.The core features are as follows:
- [Quick Start Example](./README.md#quick-start-example)
- [Document](README.md#document)
- [Community](README.md#community)
<h2 align="center">Motivation</h2>
We consider deploying deep learning inference service online to be a user-facing application in the future. **The goal of this project**: When you have trained a deep neural net with [Paddle](https://github.com/PaddlePaddle/Paddle), you are also capable to deploy the model online easily. A demo of Paddle Serving is as follows: - Integrate high-performance server-side inference engine paddle Inference and mobile-side engine paddle Lite. Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle).
- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline.The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md)
- Support multiple [protocols](doc/C++_Serving/Inference_Protocols_CN.md ) such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK.
- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, etc.- Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, etc.; Integrate acceleration libraries of Intel MKLDNN and Nvidia TensorRT, and low-precision and quantitative inference.
- Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice.
- Support cloud deployment, provide a deployment case of Baidu Cloud Intelligent Cloud kubernetes cluster.
- Provide more than 40 classic pre-model deployment examples, such as PaddleOCR, PaddleClas, PaddleDetection, PaddleSeg, PaddleNLP, PaddleRec and other suites, and more models continue to expand.
- Supports distributed deployment of large-scale sparse parameter index models, with features such as multiple tables, multiple shards, multiple copies, local high-frequency cache, etc., and can be deployed on a single machine or clouds.
<h3 align="center">Some Key Features of Paddle Serving</h3>
- Integrate with Paddle training pipeline seamlessly, most paddle models can be deployed **with one line command**. <h2 align="center">Tutorial</h2>
- **Industrial serving features** supported, such as models management, online loading, online A/B testing etc.
- **Highly concurrent and efficient communication** between clients and servers supported.
- **Multiple programming languages** supported on client side, such as C++, python and Java.
***
- Any model trained by [PaddlePaddle](https://github.com/paddlepaddle/paddle) can be directly used or [Model Conversion Interface](./doc/SAVE.md) for online deployment of Paddle Serving. - AIStudio tutorial(Chinese) : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1975340)
- Support [Multi-model Pipeline Deployment](./doc/PIPELINE_SERVING.md), and provide the requirements of the REST interface and RPC interface itself, [Pipeline example](./python/examples/pipeline).
- Support the model zoos from the Paddle ecosystem, such as [PaddleDetection](./python/examples/detection), [PaddleOCR](./python/examples/ocr), [PaddleRec](https://github.com/PaddlePaddle/PaddleRec/tree/master/recserving/movie_recommender).
- Provide a variety of pre-processing and post-processing to facilitate users in training, deployment and other stages of related code, bridging the gap between AI developers and application developers, please refer to
[Serving Examples](./python/examples/).
- Video tutorial(Chinese) : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
<p align="center"> <p align="center">
<img src="doc/images/demo.gif" width="700"> <img src="doc/images/demo.gif" width="700">
</p> </p>
<h2 align="center">Documentation</h2>
> Set up
This chapter guides you through the installation and deployment steps. It is strongly recommended to use Docker to deploy Paddle Serving. If you do not use docker, ignore the docker-related steps. Paddle Serving can be deployed on cloud servers using Kubernetes, running on many commonly hardwares such as ARM CPU, Intel CPU, Nvidia GPU, Kunlun XPU. The latest development kit of the develop branch is compiled and generated every day for developers to use.
- [Install Paddle Serving using docker](doc/Install_EN.md)
- [Build Paddle Serving from Source with Docker](doc/Compile_EN.md)
- [Deploy Paddle Serving on Kubernetes](doc/Run_On_Kubernetes_CN.md)
- [Deploy Paddle Serving with Security gateway](doc/Serving_Auth_Docker_CN.md)
- [Deploy Paddle Serving on more hardwares](doc/Run_On_XPU_EN.md)
- [Latest Wheel packages](doc/Latest_Packages_CN.md)(Update everyday on branch develop)
> Use
The first step is to call the model save interface to generate a model parameter configuration file (.prototxt), which will be used on the client and server. The second step, read the configuration and startup parameters and start the service. According to API documents and your case, the third step is to write client requests based on the SDK, and test the inference service.
<h2 align="center">AIStudio Turorial</h2> - [Quick Start](doc/Quick_Start_EN.md)
- [Save a servable model](doc/Save_EN.md)
- [Description of configuration and startup parameters](doc/Serving_Configure_EN.md)
- [Guide for RESTful/gRPC/bRPC APIs](doc/C++_Serving/Introduction_CN.md)
- [Infer on quantizative models](doc/Low_Precision_CN.md)
- [Data format of classic models](doc/Process_data_CN.md)
- [C++ Serving](doc/C++_Serving/Introduction_CN.md)
- [protocols](doc/C++_Serving/Inference_Protocols_CN.md)
- [Hot loading models](doc/C++_Serving/Hot_Loading_EN.md)
- [A/B Test](doc/C++_Serving/ABTest_EN.md)
- [Encryption](doc/C++_Serving/Encryption_EN.md)
- [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md)
- [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md)
- [Python Pipeline](doc/Python_Pipeline/Pipeline_Design_EN.md)
- [Analyze and optimize performance](doc/Python_Pipeline/Pipeline_Design_EN.md)
- [Benchmark(Chinese)](doc/Python_Pipeline/Benchmark_CN.md)
- Client SDK
- [Python SDK(Chinese)](doc/C++_Serving/Http_Service_CN.md)
- [JAVA SDK](doc/Java_SDK_EN.md)
- [C++ SDK(Chinese)](doc/C++_Serving/Creat_C++Serving_CN.md)
- [Large-scale sparse parameter server](doc/Cube_Local_EN.md)
Here we provide tutorial on AIStudio(Chinese Version) [AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) <br>
The tutorial provides > Developers
<ul>
<li>Paddle Serving Environment Setup</li>
<ul>
<li>Running in docker images
<li>pip install Paddle Serving
</ul>
<li>Quick Experience of Paddle Serving</li>
<li>Advanced Tutorial of Model Deployment</li>
<ul>
<li>Save/Convert Models for Paddle Serving</li>
<li>Setup Online Inference Service</li>
</ul>
<li>Paddle Serving Examples</li>
<ul>
<li>Paddle Serving for Detections</li>
<li>Paddle Serving for OCR</li>
</ul>
</ul>
For Paddle Serving developers, we provide extended documents such as custom OP, level of detail(LOD) processing.
- [Custom Operators](doc/C++_Serving/OP_EN.md)
- [Processing LOD Data](doc/LOD_EN.md)
- [FAQ(Chinese)](doc/FAQ_CN.md)
<h2 align="center">Installation</h2> <h2 align="center">Model Zoo</h2>
We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](doc/RUN_IN_DOCKER.md). See the [document](doc/DOCKER_IMAGES.md) for more docker images.
**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version. Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples, for a total of 42 models.
**Attention:** the following so-called 'python' or 'pip' stands for one of Python 3.6/3.7/3.8. <p align="center">
| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP |
| :----: | :----: | :----: | :----: | :----: | :----: |
| 8 | 12 | 13 | 2 | 3 | 4 |
</p>
For more model examples, read [Model zoo](doc/Model_Zoo_EN.md)
``` <center class="half">
# Run CPU Docker <img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel <img src="doc/images/detection.png" width="350">
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# Run GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
install python dependencies
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.6.2
pip3 install paddle-serving-server==0.6.2 # CPU
pip3 install paddle-serving-app==0.6.2
pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7
# DO NOT RUN ALL COMMANDS! check your GPU env and select the right one
pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7
```
You may need to use a domestic mirror source (in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download.
If you need install modules compiled with develop branch, please download packages from [latest packages list](./doc/LATEST_PACKAGES.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](./doc/COMPILE.md)
Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10.
Packages of paddle-serving-client and paddle-serving-app support Linux and Windows, but paddle-serving-client only support python3.6/3.7/3.8.
**For latest version, Cuda 9.0 or Cuda 10.0 are no longer supported, Python2.7/3.5 is no longer supported.**
Recommended to install paddle >= 2.1.0
```
# CPU users, please run
pip3 install paddlepaddle==2.1.0
# GPU Cuda10.2 please run
pip3 install paddlepaddle-gpu==2.1.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the url link of the corresponding GPU environment and install it. For example, for Python3.6 users of Cuda 10.1, please select `cp36-cp36m` and
The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it and run
```
pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`. More info please check [Paddle Serving uses TensorRT](./doc/TENSOR_RT.md)
If it is other environment and Python version, please find the corresponding link in the table and install it with pip.
For **Windows Users**, please read the document [Paddle Serving for Windows Users](./doc/WINDOWS_TUTORIAL.md)
<h2 align="center">Quick Start Example</h2>
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
### Boston House Price Prediction model
get into the Serving git directory, and change dir to `fit_a_line`
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving provides HTTP and RPC based service for users to access
### RPC service
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `batch_infer_size` | int[]| `0` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### Description of asynchronous model
Asynchronous mode is suitable for 1. When the number of requests is very large, 2. When multiple models are concatenated and you want to specify the concurrency number of each model.
Asynchronous mode helps to improve the throughput (QPS) of service, but for a single request, the delay will increase slightly.
In asynchronous mode, each model will start n threads of the number you specify, and each thread contains a model instance. In other words, each model is equivalent to a thread pool containing N threads, and the task is taken from the task queue of the thread pool to execute.
In asynchronous mode, each RPC server thread is only responsible for putting the request into the task queue of the model thread pool. After the task is executed, the completed task is removed from the task queue.
In the above table, the number of RPC server threads is specified by --thread, and the default value is 2.
--runtime_thread_num specifies the number of threads in the thread pool of each model. The default value is 0, indicating that asynchronous mode is not used.
--batch_infer_size specifies the number of batches for each model. The default value is 32. It takes effect when --runtime_thread_num is not 0.
#### When you want a model to use multiple GPU cards.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### When you want 2 models.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### When you want 2 models, and want each of them use multiple GPU cards.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### When a service contains two models, and each model needs to specify multiple GPU cards, and needs asynchronous mode, each model specifies different concurrency number.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8
</center> </center>
```python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
import numpy as np
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training.
### WEB service
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `python/examples/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
for client side,
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
the response is
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline Service</h3>
Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](./python/examples/pipeline/ocr).
we get two models
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
then we start server side, launch two models as one standalone web service
```
python3 web_service.py
```
http request
```
python3 pipeline_http_client.py
```
grpc request
```
python3 pipeline_rpc_client.py
```
output
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
<h3 align="center">Stop Serving/Pipeline service</h3>
**Method one** :Ctrl+C to quit
**Method Two** :In the path where starting the Serving/Pipeline service or the path which environment variable SERVING_HOME set (the file named ProcessInfo.json exists in this path)
```
python3 -m paddle_serving_server.serve stop
```
<h2 align="center">Document</h2>
### New to Paddle Serving
- [How to save a servable model?](doc/SAVE.md)
- [Write Bert-as-Service in 10 minutes](doc/BERT_10_MINS.md)
- [Paddle Serving Examples](python/examples)
- [How to process natural data in Paddle Serving?(Chinese)](doc/PROCESS_DATA.md)
- [How to process level of detail(LOD)?](doc/LOD.md)
### Developers
- [How to deploy Paddle Serving on K8S?(Chinese)](doc/PADDLE_SERVING_ON_KUBERNETES.md)
- [How to route Paddle Serving to secure endpoint?(Chinese)](doc/SERVING_AUTH_DOCKER.md)
- [How to develop a new Web Service?](doc/NEW_WEB_SERVICE.md)
- [Compile from source code](doc/COMPILE.md)
- [Develop Pipeline Serving](doc/PIPELINE_SERVING.md)
- [Deploy Web Service with uWSGI](doc/UWSGI_DEPLOY.md)
- [Hot loading for model file](doc/HOT_LOADING_IN_SERVING.md)
- [Paddle Serving uses TensorRT](doc/TENSOR_RT.md)
### About Efficiency
- [How to profile Paddle Serving latency?](python/examples/util)
- [How to optimize performance?](doc/PERFORMANCE_OPTIM.md)
- [Deploy multi-services on one GPU(Chinese)](doc/MULTI_SERVICE_ON_ONE_GPU_CN.md)
- [GPU Benchmarks(Chinese)](doc/BENCHMARKING_GPU.md)
### Design
- [Design Doc](doc/DESIGN_DOC.md)
### FAQ
- [FAQ(Chinese)](doc/FAQ.md)
<h2 align="center">Community</h2> <h2 align="center">Community</h2>
### Slack If you want to communicate with developers and other users? Welcome to join us, join the community through the following methods below.
To connect with other users and contributors, welcome to join our [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) ### Wechat
- WeChat scavenging
<center class="half">
<img src="doc/images/wechat_group_1.jpeg" width="250">
</center>
### QQ
- 飞桨推理部署交流群(Group No.:697765514)
<center class="half">
<img src="doc/images/qq_group_1.png" width="200">
</center>
### Slack
### Contribution - [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ)
If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/CONTRIBUTE.md) > Contribution
- Special Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/Contribute_EN.md)
- Special Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark - Thanks to [@loveululu](https://github.com/loveululu) for providing python API of Cube.
- Special Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark and modifying resize comment error - Thanks to [@EtachGu](https://github.com/EtachGu) in updating run docker codes.
- Special Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models - Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command
- Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark
- Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark modifying resize comment error
- Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models
### Feedback > Feedback
For any feedback or to report a bug, please propose a [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues). For any feedback or to report a bug, please propose a [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues).
### License > License
[Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE) [Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE)
...@@ -6,330 +6,148 @@ ...@@ -6,330 +6,148 @@
<br> <br>
<p> <p>
<p align="center"> <p align="center">
<br> <br>
<a href="https://travis-ci.com/PaddlePaddle/Serving"> <a href="https://travis-ci.com/PaddlePaddle/Serving">
<img alt="Build Status" src="https://img.shields.io/travis/com/PaddlePaddle/Serving/develop"> <img alt="Build Status" src="https://img.shields.io/travis/com/PaddlePaddle/Serving/develop?style=flat-square">
<img alt="Docs" src="https://img.shields.io/badge/docs-中文文档-brightgreen?style=flat-square">
<img alt="Release" src="https://img.shields.io/badge/release-0.7.0-blue?style=flat-square">
<img alt="Python" src="https://img.shields.io/badge/python-3.6+-blue?style=flat-square">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/Serving?color=blue&style=flat-square">
<img alt="Forks" src="https://img.shields.io/github/forks/PaddlePaddle/Serving?color=yellow&style=flat-square">
<img alt="Issues" src="https://img.shields.io/github/issues/PaddlePaddle/Serving?color=yellow&style=flat-square">
<img alt="Contributors" src="https://img.shields.io/github/contributors/PaddlePaddle/Serving?color=orange&style=flat-square">
<img alt="Community" src="https://img.shields.io/badge/join-Wechat,QQ,Slack-orange?style=flat-square">
</a> </a>
<img alt="Release" src="https://img.shields.io/badge/Release-0.0.3-yellowgreen">
<img alt="Issues" src="https://img.shields.io/github/issues/PaddlePaddle/Serving">
<img alt="License" src="https://img.shields.io/github/license/PaddlePaddle/Serving">
<img alt="Slack" src="https://img.shields.io/badge/Join-Slack-green">
<br> <br>
<p> <p>
***
Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving支持RESTful、gRPC、bRPC等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。核心特性如下:
- [动机](./README_CN.md#动机) - 集成高性能服务端推理引擎paddle Inference和移动端引擎paddle Lite,其他机器学习平台(Caffe/TensorFlow/ONNX/PyTorch)可通过[x2paddle](https://github.com/PaddlePaddle/X2Paddle)工具迁移模型
- [教程](./README_CN.md#教程) - 具有高性能C++和高易用Python 2套框架。C++框架基于高性能bRPC网络框架打造高吞吐、低延迟的推理服务,性能领先竞品。Python框架基于gRPC/gRPC-Gateway网络框架和Python语言构建高易用、高吞吐推理服务框架。技术选型参考[技术选型](doc/Serving_Design_CN.md)
- [安装](./README_CN.md#安装) - 支持HTTP、gRPC、bRPC等多种[协议](doc/C++_Serving/Inference_Protocols_CN.md);提供C++、Python、Java语言SDK
- [快速开始示例](./README_CN.md#快速开始示例) - 设计并实现基于有向无环图(DAG)的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理等特性
- [文档](README_CN.md#文档) - 适配x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑XPU等多种硬件;集成Intel MKLDNN、Nvidia TensorRT加速库,以及低精度和量化推理
- [社区](README_CN.md#社区) - 提供一套模型安全部署解决方案,包括加密模型部署、鉴权校验、HTTPs安全网关,并在实际项目中应用
- 支持云端部署,提供百度云智能云kubernetes集群部署Paddle Serving案例
<h2 align="center">动机</h2> - 提供丰富的经典预模型部署示例,如PaddleOCR、PaddleClas、PaddleDetection、PaddleSeg、PaddleNLP、PaddleRec等套件,共计40+个预训练精品模型,更多模型持续扩展
- 支持大规模稀疏参数索引模型分布式部署,具有多表、多分片、多副本、本地高频cache等特性、可单机或云端部署
Paddle Serving 旨在帮助深度学习开发者轻易部署在线预测服务。 **本项目目标**: 当用户使用 [Paddle](https://github.com/PaddlePaddle/Paddle) 训练了一个深度神经网络,就同时拥有了该模型的预测服务。
<h3 align="center">Paddle Serving的核心功能</h3>
- 与Paddle训练紧密连接,绝大部分Paddle模型可以 **一键部署**.
- 支持 **工业级的服务能力** 例如模型管理,在线加载,在线A/B测试等.
- 支持客户端和服务端之间 **高并发和高效通信**.
- 支持 **多种编程语言** 开发客户端,例如C++, Python和Java.
*** <h2 align="center">教程</h2>
- 任何经过[PaddlePaddle](https://github.com/paddlepaddle/paddle)训练的模型,都可以经过直接保存或是[模型转换接口](./doc/SAVE_CN.md),用于Paddle Serving在线部署。 - AIStudio教程-[Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1975340)
- 支持[多模型串联服务部署](./doc/PIPELINE_SERVING_CN.md), 同时提供Rest接口和RPC接口以满足您的需求,[Pipeline示例](./python/examples/pipeline)
- 支持Paddle生态的各大模型库, 例如[PaddleDetection](./python/examples/detection)[PaddleOCR](./python/examples/ocr)[PaddleRec](https://github.com/PaddlePaddle/PaddleRec/tree/master/recserving/movie_recommender)
- 提供丰富多彩的前后处理,方便用户在训练、部署等各阶段复用相关代码,弥合AI开发者和应用开发者之间的鸿沟,详情参考[模型示例](./python/examples/)
- 视频教程-[深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
<p align="center"> <p align="center">
<img src="doc/images/demo.gif" width="700"> <img src="doc/images/demo.gif" width="700">
</p> </p>
<h2 align="center">教程</h2> <h2 align="center">文档</h2>
Paddle Serving开发者为您提供了简单易用的[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) ***
教程提供了如下内容 > 部署
此章节引导您完成安装和部署步骤,强烈推荐使用Docker部署Paddle Serving,如您不使用docker,省略docker相关步骤。在云服务器上可以使用Kubernetes部署Paddle Serving。在异构硬件如ARM CPU、昆仑XPU上编译或使用Paddle Serving可以下面的文档。每天编译生成develop分支的最新开发包供开发者使用。
- [使用docker安装Paddle Serving](doc/Install_CN.md)
- [源码编译安装Paddle Serving](doc/Compile_CN.md)
- [在Kuberntes集群上部署Paddle Serving](doc/Run_On_Kubernetes_CN.md)
- [部署Paddle Serving安全网关](doc/Serving_Auth_Docker_CN.md)
- [在异构硬件部署Paddle Serving](doc/Run_On_XPU_CN.md)
- [最新Wheel开发包](doc/Latest_Packages_CN.md)(develop分支每日更新)
> 使用
安装Paddle Serving后,使用快速开始将引导您运行Serving。第一步,调用模型保存接口,生成模型参数配置文件(.prototxt)用以在客户端和服务端使用;第二步,阅读配置和启动参数并启动服务;第三步,根据API和您的使用场景,基于SDK编写客户端请求,并测试推理服务。您想了解跟多特性的使用场景和方法,请详细阅读以下文档。
- [快速开始](doc/Quick_Start_CN.md)
- [保存用于Paddle Serving的模型和配置](doc/Save_CN.md)
- [配置和启动参数的说明](doc/Serving_Configure_CN.md)
- [RESTful/gRPC/bRPC API指南](doc/C++_Serving/Introduction_CN.md#4.Client端特性)
- [低精度推理](doc/Low_Precision_CN.md)
- [常见模型数据处理](doc/Process_data_CN.md)
- [C++ Serving简介](doc/C++_Serving/Introduction_CN.md)
- [协议](doc/C++_Serving/Inference_Protocols_CN.md)
- [模型热加载](doc/C++_Serving/Hot_Loading_CN.md)
- [A/B Test](doc/C++_Serving/ABTest_CN.md)
- [加密模型推理服务](doc/C++_Serving/Encryption_CN.md)
- [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md)
- [性能指标](doc/C++_Serving/Benchmark_CN.md)
- [Python Pipeline简介](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [性能优化指南](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [性能指标](doc/Python_Pipeline/Benchmark_CN.md)
- 客户端SDK
- [Python SDK](doc/C++_Serving/Http_Service_CN.md)
- [JAVA SDK](doc/Java_SDK_CN.md)
- [C++ SDK](doc/C++_Serving/Creat_C++Serving_CN.md)
- [大规模稀疏参数索引服务](doc/Cube_Local_CN.md)
> 开发者
为Paddle Serving开发者,提供自定义OP,变长数据处理。
- [自定义OP](doc/C++_Serving/OP_CN.md)
- [变长数据(LOD)处理](doc/LOD_CN.md)
- [常见问答](doc/FAQ_CN.md)
<h2 align="center">模型库</h2>
Paddle Serving与Paddle模型套件紧密配合,实现大量服务化部署,包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例,以及Paddle全链条项目,共计42个模型。
<ul> <p align="center">
<li>Paddle Serving环境安装</li>
<ul>
<li>Docker镜像启动方式
<li>pip安装Paddle Serving
</ul>
<li>快速体验部署在线推理服务</li>
<li>部署在线推理服务进阶流程</li>
<ul>
<li>获取可用于部署在线服务的模型</li>
<li>启动推理服务</li>
</ul>
<li>Paddle Serving在线部署实例</li>
<ul>
<li>使用Paddle Serving部署图像检测服务</li>
<li>使用Paddle Serving部署OCR Pipeline在线服务</li>
</ul>
</ul>
| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP |
| :----: | :----: | :----: | :----: | :----: | :----: |
| 8 | 12 | 13 | 2 | 3 | 4 |
<h2 align="center">安装</h2> </p>
**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](doc/RUN_IN_DOCKER_CN.md)。更多镜像请查看[Docker镜像列表](doc/DOCKER_IMAGES_CN.md) 更多模型示例参考Repo,可进入[模型库](doc/Model_Zoo_CN.md)
**提示**:目前paddlepaddle 2.1版本的默认GPU环境是Cuda 10.2,因此GPU Docker的示例代码以Cuda 10.2为准。镜像和pip安装包也提供了其余GPU环境,用户如果使用其他环境,需要仔细甄别并选择合适的版本。 <p align="center">
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
<img src="doc/images/detection.png" width="350">
</p>
**提示**:本项目仅支持Python3.6/3.7/3.8,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。 <h2 align="center">社区</h2>
```
# 启动 CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# 启动 GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
安装所需的pip依赖
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.6.2
pip3 install paddle-serving-server==0.6.2 # CPU
pip3 install paddle-serving-app==0.6.2
pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7
# 其他GPU环境需要确认环境再选择执行哪一条
pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7
```
您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。
如果需要使用develop分支编译的安装包,请从[最新安装包列表](./doc/LATEST_PACKAGES.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](./doc/COMPILE_CN.md)
paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。
**最新的0.6.2的版本,已经不支持Cuda 9.0和Cuda 10.0,Python已不支持2.7和3.5。**
推荐安装2.1.0及以上版本的paddle
```
# CPU环境请执行
pip3 install paddlepaddle==2.1.0
# GPU Cuda10.2环境请执行
pip3 install paddlepaddle-gpu==2.1.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`对应的url,复制下来并执行
```
pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.1.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。更多信息请参考[如何使用TensorRT?](doc/TENSOR_RT_CN.md)
如果是其他环境和Python版本,请在表格中找到对应的链接并用pip安装。
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](./doc/WINDOWS_TUTORIAL_CN.md)
<h2 align="center">快速开始示例</h2>
这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。
<h3 align="center">波士顿房价预测</h3>
进入到Serving的git目录下,进入到`fit_a_line`例子
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
<h3 align="center">RPC服务</h3>
用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 异步模型的说明
异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--runtime_thread_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--batch_infer_size 指定的各个模型的batch数量,默认值为32,该参数只有当--runtime_thread_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### 当您的一个服务包含两个模型部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8
</center>
``` python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict``fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"``"price"`
<h3 align="center">HTTP服务</h3>
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`python/examples/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
客户端输入
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
返回结果
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline服务</h3>
Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](python/examples/pipeline/ocr),在目录`python/examples/pipeline/ocr`
我们先获取两个模型
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
然后启动服务端程序,将两个串联的模型作为一个整体的服务。
```
python3 web_service.py
```
最终使用http的方式请求
```
python3 pipeline_http_client.py
```
也支持rpc的方式
```
python3 pipeline_rpc_client.py
```
输出
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
<h3 align="center">关闭Serving/Pipeline服务</h3>
**方式一** :Ctrl+C关停服务
**方式二** :在启动Serving/Pipeline服务路径或者环境变量SERVING_HOME路径下(该路径下存在文件ProcessInfo.json)
```
python3 -m paddle_serving_server.serve stop
```
<h2 align="center">文档</h2> 您想要同开发者和其他用户沟通吗?欢迎加入我们,通过如下方式加入社群
### 新手教程 ### 微信
- [怎样保存用于Paddle Serving的模型?](doc/SAVE_CN.md) - 微信用户请扫码
- [十分钟构建Bert-As-Service](doc/BERT_10_MINS_CN.md)
- [Paddle Serving示例合辑](python/examples)
- [如何在Paddle Serving处理常见数据类型](doc/PROCESS_DATA.md)
- [如何在Serving上处理level of details(LOD)?](doc/LOD_CN.md)
### 开发者教程
- [如何开发一个新的Web Service?](doc/NEW_WEB_SERVICE_CN.md)
- [如何编译PaddleServing?](doc/COMPILE_CN.md)
- [如何开发Pipeline?](doc/PIPELINE_SERVING_CN.md)
- [如何在K8S集群上部署Paddle Serving?](doc/PADDLE_SERVING_ON_KUBERNETES.md)
- [如何在Paddle Serving上部署安全网关?](doc/SERVING_AUTH_DOCKER.md)
- [如何开发Pipeline?](doc/PIPELINE_SERVING_CN.md)
- [如何使用uWSGI部署Web Service](doc/UWSGI_DEPLOY_CN.md)
- [如何实现模型文件热加载](doc/HOT_LOADING_IN_SERVING_CN.md)
- [如何使用TensorRT?](doc/TENSOR_RT_CN.md)
### 关于Paddle Serving性能
- [如何测试Paddle Serving性能?](python/examples/util/)
- [如何优化性能?](doc/PERFORMANCE_OPTIM_CN.md)
- [在一张GPU上启动多个预测服务](doc/MULTI_SERVICE_ON_ONE_GPU_CN.md)
- [GPU版Benchmarks](doc/BENCHMARKING_GPU.md)
### 设计文档
- [Paddle Serving设计文档](doc/DESIGN_DOC_CN.md)
### FAQ
- [常见问答](doc/FAQ.md)
<h2 align="center">社区</h2> <p align="center">
<img src="doc/images/wechat_group_1.jpeg" width="250">
</p>
### Slack ### QQ
- 飞桨推理部署交流群(Group No.:697765514)
<p align="center">
<img src="doc/images/qq_group_1.png" width="200">
</p>
想要同开发者和其他用户沟通吗?欢迎加入我们的 [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ) ### Slack
- [Slack channel](https://paddleserving.slack.com/archives/CUBPKHKMJ)
### 贡献代码
如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines](doc/CONTRIBUTE.md) > 贡献代码
- 特别感谢 [@BeyondYourself](https://github.com/BeyondYourself) 提供grpc教程,更新FAQ教程,整理文件目录。 如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines](doc/Contribute_EN.md)
- 特别感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本 - 感谢 [@loveululu](https://github.com/loveululu) 提供 Cube python API
- 特别感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误 - 感谢 [@EtachGu](https://github.com/EtachGu) 更新 docker 使用命令
- 特别感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型 - 感谢 [@BeyondYourself](https://github.com/BeyondYourself) 提供grpc教程,更新FAQ教程,整理文件目录。
- 感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本
- 感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误
- 感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型
### 反馈 > 反馈
如有任何反馈或是bug,请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交 如有任何反馈或是bug,请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交
### License > License
[Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE) [Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE)
...@@ -54,7 +54,7 @@ Hwvideoframe provides a variety of data preprocessing methods for photo preproce ...@@ -54,7 +54,7 @@ Hwvideoframe provides a variety of data preprocessing methods for photo preproce
## Quick start ## Quick start
[After compiling from code](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md),this project will be stored in reader。 [After compiling from code](../../../doc/Compile_EN.md),this project will be stored in reader。
## How to Test ## How to Test
......
## Build Bert-As-Service in 10 minutes
([简体中文](./BERT_10_MINS_CN.md)|English)
The goal of Bert-As-Service is to give a sentence, and the service can represent the sentence as a semantic vector and return it to the user. [Bert model](https://arxiv.org/abs/1810.04805) is a popular model in the current NLP field. It has achieved good results on a variety of public NLP tasks. The semantic vector calculated by the Bert model is used as input to other NLP models, which will also greatly improve the performance of the model. Bert-As-Service allows users to easily obtain the semantic vector representation of text and apply it to their own tasks. In order to achieve this goal, we have shown in five steps that using Paddle Serving can build such a service in ten minutes. All the code and files in the example can be found in [Example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) of Paddle Serving.
If your python version is 3.X, replace the 'pip' field in the following command with 'pip3',replace 'python' with 'python3'.
### Step1: Getting Model
#### method 1:
This example use model [BERT Chinese Model](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel) from [Paddlehub](https://github.com/PaddlePaddle/PaddleHub).
Install paddlehub first
```
pip install paddlehub
```
run
```
python prepare_model.py 128
```
**PaddleHub only support Python 3.5+**
the 128 in the command above means max_seq_len in BERT model, which is the length of sample after preprocessing.
the config file and model file for server side are saved in the folder bert_seq128_model.
the config file generated for client side is saved in the folder bert_seq128_client.
#### method 2:
You can also download the above model from BOS(max_seq_len=128). After decompression, the config file and model file for server side are stored in the bert_chinese_L-12_H-768_A-12_model folder, and the config file generated for client side is stored in the bert_chinese_L-12_H-768_A-12_client folder:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2: Getting Dict and Sample Dataset
```
sh get_data.sh
```
this script will download Chinese Dictionary File vocab.txt and Chinese Sample Data data-c.txt
### Step3: Launch Service
start cpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #cpu inference service
```
Or,start gpu inference service,Run
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #launch gpu inference service at GPU 0
```
| Parameters | Meaning |
| ---------- | ---------------------------------------- |
| model | server configuration and model file path |
| thread | server-side threads |
| port | server port number |
| gpu_ids | GPU index number |
### Step4: data preprocessing logic on Client Side
Paddle Serving has many built-in corresponding data preprocessing logics. For the calculation of Chinese Bert semantic representation, we use the ChineseBertReader class under paddle_serving_app for data preprocessing. Model input fields of multiple models corresponding to a raw Chinese sentence can be easily fetched by developers
Install paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5: Client Visit Serving
#### method 1: RPC Inference
Run
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
the client reads data from data-c.txt and send prediction request, the prediction is given by word vector. (Due to massive data in the word vector, we do not print it).
#### method 2: HTTP Inference
This method is divided into two steps:
1. Start an HTTP prediction server.
start cpu HTTP inference service,Run
```
python bert_web_service.py bert_seq128_model/ 9292 #launch cpu inference service
```
Or,start gpu HTTP inference service,Run
```
export CUDA_VISIBLE_DEVICES=0,1
```
set environmental variable to specify which gpus are used, the command above means gpu 0 and gpu 1 is used.
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #launch gpu inference service
```
2. Prediction via HTTP request
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### Benchmark
We tested the performance of Bert-As-Service based on Padde Serving based on V100 and compared it with the Bert-As-Service based on Tensorflow. From the perspective of user configuration, we used the same batch size and concurrent number for stress testing. The overall throughput performance data obtained under 4 V100s is as follows.
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
<!--
yum install -y libXext libSM libXrender
pip install paddlehub paddle_serving_server paddle_serving_client
sh pip_app.sh
python bert_10.py
sh server.sh &
wget https://paddle-serving.bj.bcebos.com/bert_example/data-c.txt --no-check-certificate
head -n 500 data-c.txt > data.txt
cat data.txt | python bert_client.py
if [[ $? -eq 0 ]]; then
echo "test success"
else
echo "test fail"
fi
ps -ef | grep "paddle_serving_server" | grep -v grep | awk '{print $2}' | xargs kill
-->
## 十分钟构建Bert-As-Service
(简体中文|[English](./BERT_10_MINS.md))
Bert-As-Service的目标是给定一个句子,服务可以将句子表示成一个语义向量返回给用户。[Bert模型](https://arxiv.org/abs/1810.04805)是目前NLP领域的热门模型,在多种公开的NLP任务上都取得了很好的效果,使用Bert模型计算出的语义向量来做其他NLP模型的输入对提升模型的表现也有很大的帮助。Bert-As-Service可以让用户很方便地获取文本的语义向量表示并应用到自己的任务中。为了实现这个目标,我们通过以下几个步骤说明使用Paddle Serving在十分钟内就可以搭建一个这样的服务。示例中所有的代码和文件均可以在Paddle Serving的[示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)中找到。
若使用python的版本为3.X, 将以下命令中的pip 替换为pip3, python替换为python3.
### Step1:获取模型
#### 方法1:
示例中采用[Paddlehub](https://github.com/PaddlePaddle/PaddleHub)中的[BERT中文模型](https://www.paddlepaddle.org.cn/hubdetail?name=bert_chinese_L-12_H-768_A-12&en_category=SemanticModel)
请先安装paddlehub
```
pip install paddlehub
```
执行
```
python prepare_model.py 128
```
参数128表示BERT模型中的max_seq_len,即预处理后的样本长度。
生成server端配置文件与模型文件,存放在bert_seq128_model文件夹。
生成client端配置文件,存放在bert_seq128_client文件夹。
#### 方法2:
您也可以从bos上直接下载上述模型(max_seq_len=128),解压后server端配置文件与模型文件存放在bert_chinese_L-12_H-768_A-12_model文件夹,client端配置文件存放在bert_chinese_L-12_H-768_A-12_client文件夹:
```shell
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
```
### Step2:获取词典和样例数据
```
sh get_data.sh
```
脚本将下载中文词典vocab.txt和中文样例数据data-c.txt
### Step3:启动服务
启动cpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 #启动cpu预测服务
```
或者,启动gpu预测服务,执行
```
python -m paddle_serving_server.serve --model bert_seq128_model/ --port 9292 --gpu_ids 0 #在gpu 0上启动gpu预测服务
```
| 参数 | 含义 |
| ------- | -------------------------- |
| model | server端配置与模型文件路径 |
| thread | server端线程数 |
| port | server端端口号 |
| gpu_ids | GPU索引号 |
### Step4:客户端数据预处理逻辑
Paddle Serving内建了很多经典典型对应的数据预处理逻辑,对于中文Bert语义表示的计算,我们采用paddle_serving_app下的ChineseBertReader类进行数据预处理,开发者可以很容易获得一个原始的中文句子对应的多个模型输入字段。
安装paddle_serving_app
```shell
pip install paddle_serving_app
```
### Step5:客户端访问
#### 方法1:通过RPC方式执行预测
执行
```
head data-c.txt | python bert_client.py --model bert_seq128_client/serving_client_conf.prototxt
```
启动client读取data-c.txt中的数据进行预测,预测结果为文本的向量表示(由于数据较多,脚本中没有将输出进行打印),server端的地址在脚本中修改。
#### 方法2:通过HTTP方式执行预测
该方式分为两步
1、启动一个HTTP预测服务端。
启动cpu HTTP预测服务,执行
```
python bert_web_service.py bert_seq128_model/ 9292 #启动CPU预测服务
```
或者,启动gpu HTTP预测服务,执行
```
export CUDA_VISIBLE_DEVICES=0,1
```
通过环境变量指定gpu预测服务使用的gpu,示例中指定索引为0和1的两块gpu
```
python bert_web_service_gpu.py bert_seq128_model/ 9292 #启动gpu预测服务
```
2、通过HTTP请求执行预测。
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://127.0.0.1:9292/bert/prediction
```
### 性能测试
我们基于V100对基于Padde Serving研发的Bert-As-Service的性能进行测试并与基于Tensorflow实现的Bert-As-Service进行对比,从用户配置的角度,采用相同的batch size和并发数进行压力测试,得到4块V100下的整体吞吐性能数据如下。
![4v100_bert_as_service_benchmark](images/4v100_bert_as_service_benchmark.png)
# 如何使用Paddle Serving做ABTEST # 如何使用Paddle Serving做ABTEST
(简体中文|[English](./ABTEST_IN_PADDLE_SERVING.md)) (简体中文|[English](./ABTest_EN.md))
该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。 该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示。
<img src="images/abtest.png" style="zoom:33%;" /> <img src="../images/abtest.png" style="zoom:33%;" />
需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。 需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。
### 下载数据以及模型 ### 下载数据以及模型
``` shell ``` shell
cd Serving/python/examples/imdb cd Serving/examples/C++/imdb
sh get_data.sh sh get_data.sh
``` ```
...@@ -24,13 +24,13 @@ pip install Shapely ...@@ -24,13 +24,13 @@ pip install Shapely
```` ````
您可以直接运行下面的命令来处理数据。 您可以直接运行下面的命令来处理数据。
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py) [python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。 文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。
### 启动Server端 ### 启动Server端
这里采用[Docker方式](RUN_IN_DOCKER_CN.md)启动Server端服务。 这里采用[Docker方式](../Run_In_Docker_CN.md)启动Server端服务。
首先启动BOW Server,该服务启用`8000`端口: 首先启动BOW Server,该服务启用`8000`端口:
...@@ -62,7 +62,7 @@ exit ...@@ -62,7 +62,7 @@ exit
您可以直接使用下面的命令,进行ABTEST预测。 您可以直接使用下面的命令,进行ABTEST预测。
[python abtest_client.py](../python/examples/imdb/abtest_client.py) [python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
```python ```python
from paddle_serving_client import Client from paddle_serving_client import Client
......
# ABTEST in Paddle Serving # ABTEST in Paddle Serving
([简体中文](./ABTEST_IN_PADDLE_SERVING_CN.md)|English) ([简体中文](./ABTest_CN.md)|English)
This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below. This document will use an example of text classification task based on IMDB dataset to show how to build a A/B Test framework using Paddle Serving. The structure relationship between the client and servers in the example is shown in the figure below.
<img src="images/abtest.png" style="zoom:25%;" /> <img src="../images/abtest.png" style="zoom:25%;" />
Note that: A/B Test is only applicable to RPC mode, not web mode. Note that: A/B Test is only applicable to RPC mode, not web mode.
### Download Data and Models ### Download Data and Models
```shell ```shell
cd Serving/python/examples/imdb cd Serving/examples/C++/imdb
sh get_data.sh sh get_data.sh
``` ```
...@@ -25,13 +25,13 @@ pip install Shapely ...@@ -25,13 +25,13 @@ pip install Shapely
You can directly run the following command to process the data. You can directly run the following command to process the data.
[python abtest_get_data.py](../python/examples/imdb/abtest_get_data.py) [python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
The Python code in the file will process the data `test_data/part-0` and write to the `processed.data` file. The Python code in the file will process the data `test_data/part-0` and write to the `processed.data` file.
### Start Server ### Start Server
Here, we [use docker](RUN_IN_DOCKER.md) to start the server-side service. Here, we [use docker](../Run_In_Docker_EN.md) to start the server-side service.
First, start the BOW server, which enables the `8000` port: First, start the BOW server, which enables the `8000` port:
...@@ -63,7 +63,7 @@ Before running, use `pip install paddle-serving-client` to install the paddle-se ...@@ -63,7 +63,7 @@ Before running, use `pip install paddle-serving-client` to install the paddle-se
You can directly use the following command to make abtest prediction. You can directly use the following command to make abtest prediction.
[python abtest_client.py](../python/examples/imdb/abtest_client.py) [python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
[//file]:#abtest_client.py [//file]:#abtest_client.py
``` python ``` python
...@@ -103,8 +103,8 @@ Due to different network conditions, the results of each prediction may be sligh ...@@ -103,8 +103,8 @@ Due to different network conditions, the results of each prediction may be sligh
``` ```
<!-- <!--
cp ../Serving/python/examples/imdb/get_data.sh . cp ../../examples/C++/imdb/get_data.sh .
cp ../Serving/python/examples/imdb/imdb_reader.py . cp ../../examples/C++/imdb/imdb_reader.py .
pip install -U paddle_serving_server pip install -U paddle_serving_server
pip install -U paddle_serving_client pip install -U paddle_serving_client
pip install -U paddlepaddle pip install -U paddlepaddle
......
# C++ Serving vs TensorFlow Serving 性能对比
# 1. 测试环境和说明
1) GPU型号:Tesla P4(7611 Mib)
2) Cuda版本:11.0
3) 模型:ResNet_v2_50
4) 为了测试异步合并batch的效果,测试数据中batch=1
5) [使用的测试代码和使用的数据集](../../examples/C++/PaddleClas/resnet_v2_50)
6) 下图中蓝色是C++ Serving,灰色为TF-Serving。
7) 折线图为QPS,数值越大表示每秒钟处理的请求数量越大,性能就越好。
8) 柱状图为平均处理时延,数值越大表示单个请求处理时间越长,性能就越差。
# 2. 同步模式
均使用同步模式,默认参数配置。
可以看出同步模型默认参数配置情况下,C++Serving QPS和平均时延指标均优于TF-Serving。
<p align="center">
<br>
<img src='../images/syn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
| 10 | pd-serving | 111.336 | 89.787| tf-serving| 84.632| 118.13|
|30 |pd-serving |165.928 |180.761 |tf-serving |106.572 |281.473|
|50| pd-serving| 207.244| 241.211| tf-serving| 80.002 |624.959|
|70 |pd-serving |214.769 |325.894 |tf-serving |105.17 |665.561|
|100| pd-serving| 235.405| 424.759| tf-serving| 93.664 |1067.619|
|150 |pd-serving |239.114 |627.279 |tf-serving |86.312 |1737.848|
# 3. 异步模式
均使用异步模式,最大batch=32,异步线程数=2。
可以看出异步模式情况下,两者性能接近,但当Client端并发数达到70的时候,TF-Serving服务直接超时,而C++Serving能够正常返回结果。
同时,对比同步和异步模式可以看出,异步模式在请求batch数较小时,通过合并batch能够有效提高QPS和平均处理时延。
<p align="center">
<br>
<img src='../images/asyn_benchmark.png'">
<br>
<p>
|client_num | model_name | qps(samples/s) | mean(ms) | model_name | qps(samples/s) | mean(ms) |
| --- | --- | --- | --- | --- | --- | --- |
|10| pd-serving| 130.631| 76.502| tf-serving |172.64 |57.916|
|30| pd-serving| 201.062| 149.168| tf-serving| 241.669| 124.128|
|50| pd-serving| 286.01| 174.764| tf-serving |278.744 |179.367|
|70| pd-serving| 313.58| 223.187| tf-serving| 298.241| 234.7|
|100| pd-serving| 323.369| 309.208| tf-serving| 0| ∞|
|150| pd-serving| 328.248| 456.933| tf-serving| 0| ∞|
...@@ -75,15 +75,16 @@ service ImageClassifyService { ...@@ -75,15 +75,16 @@ service ImageClassifyService {
#### 2.2.2 示例配置 #### 2.2.2 示例配置
关于Serving端的配置的详细信息,可以参考[Serving端配置](SERVING_CONFIGURE.md) 关于Serving端的配置的详细信息,可以参考[Serving端配置](../Serving_Configure_CN.md)
以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[设计文档](C++DESIGN_CN.md)) 以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[OP介绍](./OP_CN.md)[DAG介绍](./DAG_CN.md))
- 配置文件示例: - 配置文件示例:
**添加文件 serving/conf/service.prototxt** **添加文件 serving/conf/service.prototxt**
```shell ```shell![image](https://user-images.githubusercontent.com/16222477/141761999-e5b5016e-ca36-4479-82bf-a83fdb95f3c0.png)
services { services {
name: "ImageClassifyService" name: "ImageClassifyService"
workflows: "workflow1" workflows: "workflow1"
...@@ -310,7 +311,7 @@ api.thrd_finalize(); ...@@ -310,7 +311,7 @@ api.thrd_finalize();
api.destroy(); api.destroy();
``` ```
具体实现可参考paddle Serving提供的例子sdk-cpp/demo/ximage.cpp 具体实现可参考C++Serving提供的例子。sdk-cpp/demo/ximage.cpp
### 3.3 链接 ### 3.3 链接
...@@ -392,4 +393,4 @@ predictors { ...@@ -392,4 +393,4 @@ predictors {
} }
} }
``` ```
关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](CLIENT_CONFIGURE.md) 关于客户端的详细配置选项,可参考[CLIENT CONFIGURATION](./Client_Configure_CN.md)
# Server端的计算图 # Server端的计算图
(简体中文|[English](./SERVER_DAG.md)) (简体中文|[English](./DAG_EN.md))
本文档显示了Server端上计算图的概念。 如何使用PaddleServing内置运算符定义计算图。 还显示了一些顺序执行逻辑的示例。 本文档显示了Server端上计算图的概念。 如何使用PaddleServing内置运算符定义计算图。 还显示了一些顺序执行逻辑的示例。
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下: 深度神经网络通常在输入数据上有一些预处理步骤,而在模型推断分数上有一些后处理步骤。 由于深度学习框架现在非常灵活,因此可以在训练计算图之外进行预处理和后处理。 如果要在服务器端进行输入数据预处理和推理结果后处理,则必须在服务器上添加相应的计算逻辑。 此外,如果用户想在多个模型上使用相同的输入进行推理,则最好的方法是在仅提供一个客户端请求的情况下在服务器端同时进行推理,这样我们可以节省一些网络计算开销。 由于以上两个原因,自然而然地将有向无环图(DAG)视为服务器推理的主要计算方法。 DAG的一个示例如下:
<center> <center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/> <img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center> </center>
## 如何定义节点 ## 如何定义节点
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。 PaddleServing在框架中具有一些预定义的计算节点。 一种非常常用的计算图是简单的reader-infer-response模式,可以涵盖大多数单一模型推理方案。 示例图和相应的DAG定义代码如下。
<center> <center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/> <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center> </center>
``` python ``` python
...@@ -47,10 +47,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po ...@@ -47,10 +47,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### 包含多个输入的节点 ### 包含多个输入的节点
[Paddle Serving中的集成预测](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。 [Paddle Serving中的集成预测](./Model_Ensemble_CN.md)文档中给出了一个包含多个输入节点的样例,示意图和代码如下。
<center> <center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/> <img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center> </center>
```python ```python
......
# Computation Graph On Server # Computation Graph On Server
([简体中文](./SERVER_DAG_CN.md)|English) ([简体中文](./DAG_CN.md)|English)
This document shows the concept of computation graph on server. How to define computation graph with PaddleServing built-in operators. Examples for some sequential execution logics are shown as well. This document shows the concept of computation graph on server. How to define computation graph with PaddleServing built-in operators. Examples for some sequential execution logics are shown as well.
...@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co ...@@ -9,7 +9,7 @@ This document shows the concept of computation graph on server. How to define co
Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows: Deep neural nets often have some preprocessing steps on input data, and postprocessing steps on model inference scores. Since deep learning frameworks are now very flexible, it is possible to do preprocessing and postprocessing outside the training computation graph. If we want to do input data preprocessing and inference result postprocess on server side, we have to add the corresponding computation logics on server. Moreover, if a user wants to do inference with the same inputs on more than one model, the best way is to do the inference concurrently on server side given only one client request so that we can save some network computation overhead. For the above two reasons, it is naturally to think of a Directed Acyclic Graph(DAG) as the main computation method for server inference. One example of DAG is as follows:
<center> <center>
<img src='images/server_dag.png' width = "450" height = "500" align="middle"/> <img src='../images/server_dag.png' width = "450" height = "500" align="middle"/>
</center> </center>
## How to define Node ## How to define Node
...@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc ...@@ -19,7 +19,7 @@ Deep neural nets often have some preprocessing steps on input data, and postproc
PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows. PaddleServing has some predefined Computation Node in the framework. A very commonly used Computation Graph is the simple reader-inference-response mode that can cover most of the single model inference scenarios. A example graph and the corresponding DAG definition code is as follows.
<center> <center>
<img src='images/simple_dag.png' width = "260" height = "370" align="middle"/> <img src='../images/simple_dag.png' width = "260" height = "370" align="middle"/>
</center> </center>
``` python ``` python
...@@ -48,10 +48,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po ...@@ -48,10 +48,10 @@ python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --po
### Nodes with multiple inputs ### Nodes with multiple inputs
An example containing multiple input nodes is given in the [MODEL_ENSEMBLE_IN_PADDLE_SERVING](./deprecated/MODEL_ENSEMBLE_IN_PADDLE_SERVING.md). A example graph and the corresponding DAG definition code is as follows. An example containing multiple input nodes is given in the [Model_Ensemble](./Model_Ensemble_EN.md). A example graph and the corresponding DAG definition code is as follows.
<center> <center>
<img src='images/complex_dag.png' width = "480" height = "400" align="middle"/> <img src='../images/complex_dag.png' width = "480" height = "400" align="middle"/>
</center> </center>
```python ```python
......
# 加密模型预测 # 加密模型预测
(简体中文|[English](ENCRYPTION.md)) (简体中文|[English](./Encryption_EN.md))
Padle Serving提供了模型加密预测功能,本文档显示了详细信息。 Padle Serving提供了模型加密预测功能,本文档显示了详细信息。
...@@ -12,7 +12,7 @@ Padle Serving提供了模型加密预测功能,本文档显示了详细信息 ...@@ -12,7 +12,7 @@ Padle Serving提供了模型加密预测功能,本文档显示了详细信息
普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。 普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。
我们提供了一个简单的演示来加密模型。请参阅[`python/examples/encryption/encrypt.py`](../python/examples/encryption/encrypt.py) 我们提供了一个简单的演示来加密模型。请参阅[examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### 启动加密服务 ### 启动加密服务
...@@ -40,5 +40,4 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_ ...@@ -40,5 +40,4 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
### 模型加密推理示例 ### 模型加密推理示例
模型加密推理示例, 请参见[`/python/examples/encryption/`](../python/examples/encryption/) 模型加密推理示例, 请参见[examples/C++/encryption/](../../examples/C++/encryption/)
# MOEDL ENCRYPTION INFERENCE # MOEDL ENCRYPTION INFERENCE
([简体中文](ENCRYPTION_CN.md)|English) ([简体中文](./Encryption_CN.md)|English)
Paddle Serving provides model encryption inference, This document shows the details. Paddle Serving provides model encryption inference, This document shows the details.
...@@ -12,7 +12,7 @@ We use symmetric encryption algorithm to encrypt the model. Symmetric encryption ...@@ -12,7 +12,7 @@ We use symmetric encryption algorithm to encrypt the model. Symmetric encryption
Normal model and parameters can be understood as a string, by using the encryption algorithm (parameter is your key) on them, the normal model and parameters become an encrypted one. Normal model and parameters can be understood as a string, by using the encryption algorithm (parameter is your key) on them, the normal model and parameters become an encrypted one.
We provide a simple demo to encrypt the model. See the [python/examples/encryption/encrypt.py](../python/examples/encryption/encrypt.py) We provide a simple demo to encrypt the model. See the [examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### Start Encryption Service ### Start Encryption Service
...@@ -40,5 +40,4 @@ Once the server gets the key, it uses the key to parse the model and starts the ...@@ -40,5 +40,4 @@ Once the server gets the key, it uses the key to parse the model and starts the
### Example of Model Encryption Inference ### Example of Model Encryption Inference
Example of model encryption inference, See the [`/python/examples/encryption/`](../python/examples/encryption/) Example of model encryption inference, See the [examples/C++/encryption/](../../examples/C++/encryption/)
# C++ Serving框架性能测试
本文以文本分类任务为例搭建Serving预测服务,给出Serving框架性能数据:
1) Serving框架净开销测试
2) 不同模型下预测服务单线程响应时间、QPS、准确率等指标和单机模式的对比
3) 不同模型下Serving扩展能力对比
# 1. Serving单次请求时间分解
下图是一个对serving请求的耗时阶段的不完整分析。图中对brpc的开销,只列出了bthread创建和启动开销。
![](../images/serving-timings.png)
(右键在新窗口中浏览大图)
试与单机模式对比:
1) 从原始样例填充PaddleTensor (几us到几十us)
2) 从PaddleTensor填充LoDTensor (几us到几十us)
3) inference (几十us到几百ms)
4) 从LoDTensor填充PaddleTensor (几us到几十us)
5) 从Paddletensor读取预测结果 (几us到几十us)
与单机模式相比,serving模式增加了:
1) protobuf数据构造和序列化与反序列化 (几us到几十us)
2) 网络通信 (单机十几us,远程500us到几十ms)
3) 和bthread创建于调度等。(十几us)
从client端看(图中total time T2),serving模式增加的时间,与inference时间的比例,对整个client端观察到的系统吞吐关系密切:
1) 当inference时间达到10+ms到几百ms (例如,文本分类的CNN模型),而serving模式增加的时间只有几ms,则client端观察到的吞吐与单机模式几乎一致
2) 当inference时间只有几个us到几十个us (例如,文本分类的BOW模型),而serving模式增加了几个ms,则client端观察到的吞吐与单机模式相比,会下降到单机模式的20%甚至更低。
**为了验证上述假设,文本分类任务的serving模式测试,需要在几个不同模型上分别进行,分别记录serving模式下,client端吞吐量的变化情况。**
# 2. 测试任务和测试环境
## 2.1 测试任务
文本分类的两种常见模型:BOW, CNN
**Batch Size: 本实验中所有请求的batch size均为50**
## 2.2 测试环境
| | CPU型号、核数 | 内存 |
| --- | --- | --- |
| Serving所在机器 | Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 40核 | 128G |
| Client所在机器 | Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz 40核 | 128G |
Serving端与Client端通信时延:0.102 ms
# 3. 净开销测试
本测试是为了描画引入Serving框架后,在Serving端空转的情况下,每query消耗的时间,也就是框架引入的开销。
所谓空转是指,serving端去除实际执行预测的计算时间,但保留拆包和组包逻辑。
| 模型 | 净开销 (ms) |
| --- | --- |
| BOW | 1 |
| CNN | 1 |
在C++ Serving模式下,框架引入的时间开销较小,约为1ms
# 4. 预测服务单线程响应时间、QPS、准确率等指标与单机模式的对比
本测试用来确定Serving的准确率、QPS和响应时间等指标与单机模式相比是否无明显异常
<table>
<thead>
<tr>
<th> 模型</th>
<th colspan=3> Serving(client与serving同机器)</th>
<th colspan=3> 单机</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>QPS</td>
<td>Latency (ms)</td>
<td>Accuracy</td>
<td>QPS</td>
<td>Latency (ms)</td>
<td>Accuracy</td>
</tr>
<tr>
<td>BOW</td>
<td> 265.393 </td>
<td>3</td>
<td>0.84348</td>
<td>715.973366</td>
<td>1.396700</td>
<td>0.843480</td>
</tr>
<tr>
<td>CNN</td>
<td>23.3002</td>
<td>42</td>
<td>0.8962</td>
<td>25.372693</td>
<td>39.412450</td>
<td>0.896200</td>
</tr>
</tbody>
</table>
<br>
准确率:Serving模式下与单机模式下预测准确率一致。
QPS:与模型特点有关:可以看到,在预测时间很短的BOW模型上,Serving框架本身的开销和网络通信固定时间在单次请求中的时间占比占了绝大部分,这导致Serving模式下的QPS与单机模式相比下降明显;而在预测时间较长的CNN模型上,Serving框架开销和网络通信时间在单次请求中的占比较小,Serving模式下QPS与单机模式下相差不多。这也验证了第1节的预期。
# 5. Serving扩展能力
Serving扩展能力的测试是指,在不同模型上:
1) 固定serving端brpc使用的系统线程数
2) 不断增加client端并发请求数
3) 运行一段时间后,client端记录当前设定下QPS、平均响应时间和各个分位点的响应时间等信息
4) serving与client在不同机器上,Serving端与Client端通信时延:0.102 ms
## 5.1 测试结论
1) 当模型较为复杂,模型本身的预测时间较长时(预测时间>10ms,以上述实验中CNN模型为例),Paddle Serving能够提供较好的线性扩展能力.
2) 当模型是简单模型,模型本身的预测时间较短时(预测时间<10ms,以上述实验中BOW模型为例),随着serving端线程数的增加,qps的增长趋势较为杂乱,看不出明显的线性趋势。猜测是因为预测时间较短,而线程切换、框架本身的开销等占了大头,导致虽然随着线程数增加,qps也有增长,但当并发数增大时,qps反而出现下降。
3) Server端线程数N的设置需要结合,最大并发请求量,机器core数量,以及预测时间长短这三个因素来确定。
5) 使用GPU进行模型测试,当模型预测时间较短时,Server端线程数不宜过多(线程数=1~4倍core数量),否则线程切换带来的开销不可忽视。
6) 使用GPU进行模型测试,当模型预测时间较长时,Server端线程数应稍大一些(线程数=4~20倍core数量)。由于模型预测对于CPU而言是一个阻塞操作,此时当前线程会在此处阻塞等待(类似于Sleep操作),若所有线程均阻塞在模型预测阶段,将没有可运行BRPC的”协程worker“的空闲线程。
7) 若机器环境允许,Server端线程数应等于或略小于最大并发量。
## 5.2 测试数据-BOW模型
### Serving 4线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 561.325 | 3563 | 7.1265 | 9 | 11 | 23 | 62 |
| 8 | 807.428 | 4954 | 9.9085 | 7 | 10 | 24 | 31 |
| 12 | 894.721 | 6706 | 13.4123 | 18 | 22 | 41 | 61 |
| 16 | 993.542 | 8052 | 16.1057 | 22 | 28 | 47 | 75 |
| 20 | 834.725 | 11980 | 23.9615 | 32 | 40 | 64 | 81 |
| 24 | 649.316 | 18481 | 36.962 | 50 | 67 | 149 | 455 |
| 28 | 709.975 | 19719 | 39.438 | 53 | 76 | 159 | 293 |
| 32 | 661.868 | 24174 | 48.3495 | 62 | 90 | 294 | 560 |
| 36 | 551.234 | 32654 | 65.3081 | 83 | 129 | 406 | 508 |
| 40 | 525.155 | 38084 | 76.1687 | 99 | 143 | 464 | 567 |
### Serving 8线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 397.693 | 5029 | 10.0585 | 11 | 15 | 75 | 323 |
| 8 | 501.567 | 7975 | 15.9515 | 18 | 25 | 113 | 327 |
| 12 | 598.027 | 10033 | 20.0663 | 24 | 33 | 125 | 390 |
| 16 | 691.384 | 11571 | 23.1427 | 31 | 42 | 105 | 348 |
| 20 | 468.099 | 21363 | 42.7272 | 53 | 74 | 232 | 444 |
| 24 | 424.553 | 28265 | 56.5315 | 67 | 102 | 353 | 448 |
| 28 | 587.692 | 23822 | 47.6457 | 61 | 83 | 287 | 494 |
| 32 | 692.911 | 23091 | 46.1833 | 66 | 94 | 184 | 389 |
| 36 | 809.753 | 22229 | 44.4581 | 59 | 76 | 256 | 556 |
| 40 | 762.108 | 26243 | 52.4869 | 74 | 98 | 290 | 475 |
### Serving 12线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 442.478 | 4520 | 9.0405 | 12 | 15 | 31 | 46 |
| 8 | 497.884 | 8034 | 16.0688 | 19 | 25 | 130 | 330 |
| 12 | 797.13 | 7527 | 15.0552 | 16 | 22 | 162 | 326 |
| 16 | 674.707 | 11857 | 23.7154 | 30 | 42 | 229 | 455 |
| 20 | 489.956 | 20410 | 40.8209 | 49 | 68 | 304 | 437 |
| 24 | 452.335 | 26529 | 53.0582 | 66 | 85 | 341 | 414 |
| 28 | 753.093 | 18590 | 37.1812 | 50 | 65 | 184 | 421 |
| 32 | 932.498 | 18278 | 36.5578 | 48 | 62 | 109 | 337 |
| 36 | 932.498 | 19303 | 38.6066 | 54 | 70 | 110 | 164 |
| 40 | 921.532 | 21703 | 43.4066 | 59 | 75 | 125 | 451 |
### Serving 16线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 559.597 | 3574 | 7.1485 | 9 | 11 | 24 | 56 |
| 8 | 896.66 | 4461 | 8.9225 | 12 | 15 | 23 | 42 |
| 12 | 1014.37 | 5915 | 11.8305 | 16 | 20 | 34 | 63 |
| 16 | 1046.98 | 7641 | 15.2837 | 21 | 28 | 48 | 64 |
| 20 | 1188.64 | 8413 | 16.8276 | 23 | 31 | 55 | 71 |
| 24 | 1013.43 | 11841 | 23.6833 | 34 | 41 | 63 | 86 |
| 28 | 933.769 | 14993 | 29.9871 | 41 | 52 | 91 | 149 |
| 32 | 930.665 | 17192 | 34.3844 | 48 | 60 | 97 | 137 |
| 36 | 880.153 | 20451 | 40.9023 | 57 | 72 | 118 | 142 |
| 40 | 939.144 | 21296 | 42.5938 | 59 | 75 | 126 | 163 |
### Serving 20线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 686.813 | 2912 | 5.825 | 7 | 9 | 18 | 54 |
| 8 | 1016.26 | 3936 | 7.87375 | 10 | 13 | 24 | 33 |
| 12 | 1282.87 | 4677 | 9.35483 | 12 | 15 | 35 | 73 |
| 16 | 1253.13 | 6384 | 12.7686 | 17 | 23 | 40 | 54 |
| 20 | 1276.49 | 7834 | 15.6696 | 22 | 28 | 53 | 90 |
| 24 | 1273.34 | 9424 | 18.8497 | 26 | 35 | 66 | 93 |
| 28 | 1258.31 | 11126 | 22.2535 | 31 | 41 | 71 | 133 |
| 32 | 1027.95 | 15565 | 31.1308 | 43 | 54 | 81 | 103 |
| 36 | 912.316 | 19730 | 39.4612 | 52 | 66 | 106 | 131 |
| 40 | 808.865 | 24726 | 49.4539 | 64 | 79 | 144 | 196 |
### Serving 24线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 635.728 | 3146 | 6.292 | 7 | 10 | 22 | 48 |
| 8 | 1089.03 | 3673 | 7.346 | 9 | 11 | 21 | 40 |
| 12 | 1087.55 | 5056 | 10.1135 | 13 | 17 | 41 | 51 |
| 16 | 1251.17 | 6394 | 12.7898 | 17 | 24 | 39 | 54 |
| 20 | 1241.31 | 8056 | 16.1136 | 21 | 29 | 51 | 72 |
| 24 | 1327.29 | 9041 | 18.0837 | 24 | 33 | 59 | 77 |
| 28 | 1066.02 | 13133 | 26.2664 | 37 | 47 | 84 | 109 |
| 32 | 1034.33 | 15469 | 30.9384 | 41 | 51 | 94 | 115 |
| 36 | 896.191 | 20085 | 40.1708 | 55 | 68 | 110 | 168 |
| 40 | 701.508 | 28510 | 57.0208 | 74 | 88 | 142 | 199 |
### Serving 28线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 592.944 | 3373 | 6.746 | 8 | 10 | 21 | 56 |
| 8 | 1050.14 | 3809 | 7.619 | 9 | 12 | 22 | 41 |
| 12 | 1220.75 | 4915 | 9.83133 | 13 | 16 | 26 | 51 |
| 16 | 1178.38 | 6789 | 13.579 | 19 | 24 | 41 | 65 |
| 20 | 1184.97 | 8439 | 16.8789 | 23 | 30 | 51 | 72 |
| 24 | 1234.95 | 9717 | 19.4341 | 26 | 34 | 53 | 94 |
| 28 | 1162.31 | 12045 | 24.0908 | 33 | 40 | 70 | 208 |
| 32 | 1160.35 | 13789 | 27.5784 | 39 | 47 | 75 | 97 |
| 36 | 991.79 | 18149 | 36.2987 | 50 | 61 | 91 | 110 |
| 40 | 952.336 | 21001 | 42.0024 | 58 | 69 | 105 | 136 |
### Serving 32线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 654.879 | 3054 | 6.109 | 7 | 9 | 18 | 39 |
| 8 | 959.463 | 4169 | 8.33925 | 11 | 13 | 24 | 39 |
| 12 | 1222.99 | 4906 | 9.81367 | 13 | 16 | 30 | 39 |
| 16 | 1314.71 | 6085 | 12.1704 | 16 | 20 | 35 | 42 |
| 20 | 1390.63 | 7191 | 14.3837 | 19 | 24 | 40 | 69 |
| 24 | 1370.8 | 8754 | 17.5096 | 24 | 30 | 45 | 62 |
| 28 | 1213.8 | 11534 | 23.0696 | 31 | 37 | 60 | 79 |
| 32 | 1178.2 | 13580 | 27.1601 | 38 | 45 | 68 | 82 |
| 36 | 1167.69 | 15415 | 30.8312 | 42 | 51 | 77 | 92 |
| 40 | 950.841 | 21034 | 42.0692 | 55 | 65 | 96 | 137 |
### Serving 36线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 611.06 | 3273 | 6.546 | 7 | 10 | 23 | 63 |
| 8 | 948.992 | 4215 | 8.43 | 10 | 13 | 38 | 87 |
| 12 | 1081.47 | 5548 | 11.0972 | 15 | 18 | 31 | 37 |
| 16 | 1319.7 | 6062 | 12.1241 | 16 | 21 | 35 | 64 |
| 20 | 1246.73 | 8021 | 16.0434 | 22 | 28 | 41 | 47 |
| 24 | 1210.04 | 9917 | 19.8354 | 28 | 34 | 54 | 70 |
| 28 | 1013.46 | 13814 | 27.6296 | 37 | 47 | 83 | 125 |
| 32 | 1104.44 | 14487 | 28.9756 | 41 | 49 | 72 | 88 |
| 36 | 1089.32 | 16524 | 33.0495 | 45 | 55 | 83 | 107 |
| 40 | 940.115 | 21274 | 42.5481 | 58 | 68 | 101 | 138 |
### Serving 40线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 610.314 | 3277 | 6.555 | 8 | 11 | 20 | 57 |
| 8 | 1065.34 | 4001 | 8.0035 | 10 | 12 | 23 | 29 |
| 12 | 1177.86 | 5632 | 11.2645 | 14 | 18 | 33 | 310 |
| 16 | 1252.74 | 6386 | 12.7723 | 17 | 22 | 40 | 63 |
| 20 | 1290.16 | 7751 | 15.5036 | 21 | 27 | 47 | 66 |
| 24 | 1153.07 | 10407 | 20.8159 | 28 | 36 | 64 | 81 |
| 28 | 1300.39 | 10766 | 21.5326 | 30 | 37 | 60 | 78 |
| 32 | 1222.4 | 13089 | 26.1786 | 36 | 45 | 75 | 99 |
| 36 | 1141.55 | 15768 | 31.5374 | 43 | 52 | 83 | 121 |
| 40 | 1125.24 | 17774 | 35.5489 | 48 | 57 | 93 | 190 |
下图是Paddle Serving在BOW模型上QPS随serving端线程数增加而变化的图表。可以看出当线程数较少时(4线程/8线程/12线程),QPS的变化规律非常杂乱;当线程数较多时,QPS曲线又基本趋于一致,基本无线性增长关系。
![](../images/qps-threads-bow.png)
(右键在新窗口中浏览大图)
## 5.3 测试数据-CNN模型
### Serving 4线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 81.9437 | 24407 | 47 | 55 | 64 | 80 | 91 |
| 8 | 142.486 | 28073 | 53 | 65 | 71 | 86 | 106 |
| 12 | 173.732 | 34536 | 66 | 79 | 86 | 105 | 126 |
| 16 | 174.894 | 45742 | 89 | 101 | 109 | 131 | 151 |
| 20 | 172.58 | 57944 | 113 | 129 | 138 | 159 | 187 |
| 24 | 178.216 | 67334 | 132 | 147 | 158 | 189 | 283 |
| 28 | 171.315 | 81721 | 160 | 180 | 192 | 223 | 291 |
| 32 | 178.17 | 89802 | 176 | 195 | 208 | 251 | 288 |
| 36 | 173.762 | 103590 | 204 | 227 | 241 | 278 | 309 |
| 40 | 177.335 | 112781 | 223 | 246 | 262 | 296 | 315 |
### Serving 8线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 86.2999 | 23175 | 44 | 50 | 54 | 72 | 92 |
| 8 | 143.73 | 27830 | 53 | 65 | 71 | 83 | 91 |
| 12 | 178.471 | 33619 | 65 | 77 | 85 | 106 | 144 |
| 16 | 180.485 | 44325 | 86 | 99 | 108 | 131 | 149 |
| 20 | 180.466 | 55412 | 108 | 122 | 131 | 153 | 170 |
| 24 | 174.452 | 68787 | 134 | 151 | 162 | 189 | 214 |
| 28 | 174.158 | 80387 | 157 | 175 | 186 | 214 | 236 |
| 32 | 172.857 | 92562 | 182 | 202 | 214 | 244 | 277 |
| 36 | 172.171 | 104547 | 206 | 228 | 241 | 275 | 304 |
| 40 | 174.435 | 114656 | 226 | 248 | 262 | 306 | 338 |
### Serving 12线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 85.6274 | 23357 | 45 | 50 | 55 | 75 | 105 |
| 8 | 137.632 | 29063 | 55 | 67 | 73 | 88 | 134 |
| 12 | 187.793 | 31950 | 61 | 73 | 79 | 94 | 123 |
| 16 | 211.512 | 37823 | 73 | 87 | 94 | 113 | 134 |
| 20 | 206.624 | 48397 | 93 | 109 | 118 | 145 | 217 |
| 24 | 209.933 | 57161 | 111 | 128 | 137 | 157 | 190 |
| 28 | 198.689 | 70462 | 137 | 154 | 162 | 186 | 205 |
| 32 | 214.024 | 74758 | 146 | 165 | 176 | 204 | 228 |
| 36 | 223.947 | 80376 | 158 | 177 | 189 | 222 | 282 |
| 40 | 226.045 | 88478 | 174 | 193 | 204 | 236 | 277 |
### Serving 16线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 82.9119 | 24122 | 45 | 52 | 60 | 79 | 99 |
| 8 | 145.82 | 27431 | 51 | 63 | 69 | 85 | 114 |
| 12 | 193.287 | 31042 | 59 | 71 | 77 | 92 | 139 |
| 16 | 240.428 | 33274 | 63 | 76 | 82 | 99 | 127 |
| 20 | 249.457 | 40087 | 77 | 91 | 99 | 127 | 168 |
| 24 | 263.673 | 45511 | 87 | 102 | 110 | 136 | 186 |
| 28 | 272.729 | 51333 | 99 | 115 | 123 | 147 | 189 |
| 32 | 269.515 | 59366 | 115 | 132 | 140 | 165 | 192 |
| 36 | 267.4 | 67315 | 131 | 148 | 157 | 184 | 220 |
| 40 | 264.939 | 75489 | 147 | 164 | 173 | 200 | 235 |
### Serving 20线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 85.5615 | 23375 | 44 | 49 | 55 | 73 | 101 |
| 8 | 148.765 | 26888 | 50 | 61 | 69 | 84 | 97 |
| 12 | 196.11 | 30595 | 57 | 70 | 75 | 88 | 108 |
| 16 | 241.087 | 33183 | 63 | 76 | 82 | 98 | 115 |
| 20 | 291.24 | 34336 | 65 | 66 | 78 | 99 | 114 |
| 24 | 301.515 | 39799 | 76 | 90 | 97 | 122 | 194 |
| 28 | 314.303 | 44543 | 86 | 101 | 109 | 132 | 173 |
| 32 | 327.486 | 48857 | 94 | 109 | 118 | 143 | 196 |
| 36 | 320.422 | 56176 | 109 | 125 | 133 | 157 | 190 |
| 40 | 325.399 | 61463 | 120 | 137 | 145 | 174 | 216 |
### Serving 24线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 85.6568 | 23349 | 45 | 50 | 57 | 72 | 110 |
| 8 | 154.919 | 25820 | 48 | 57 | 66 | 81 | 95 |
| 12 | 221.992 | 27028 | 51 | 61 | 69 | 85 | 100 |
| 16 | 272.889 | 29316 | 55 | 68 | 74 | 89 | 101 |
| 20 | 300.906 | 33233 | 63 | 75 | 81 | 95 | 108 |
| 24 | 326.735 | 36727 | 69 | 82 | 87 | 102 | 114 |
| 28 | 339.057 | 41291 | 78 | 92 | 99 | 119 | 137 |
| 32 | 346.868 | 46127 | 88 | 103 | 110 | 130 | 155 |
| 36 | 338.429 | 53187 | 102 | 117 | 124 | 146 | 170 |
| 40 | 320.919 | 62321 | 119 | 135 | 144 | 176 | 226 |
### Serving 28线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 87.8773 | 22759 | 43 | 48 | 52 | 76 | 112 |
| 8 | 154.524 | 25886 | 49 | 58 | 66 | 82 | 100 |
| 12 | 192.709 | 31135 | 59 | 72 | 78 | 93 | 112 |
| 16 | 253.59 | 31547 | 59 | 72 | 79 | 95 | 129 |
| 20 | 288.367 | 34678 | 65 | 78 | 84 | 100 | 122 |
| 24 | 307.653 | 39005 | 73 | 84 | 92 | 116 | 313 |
| 28 | 334.105 | 41903 | 78 | 90 | 97 | 119 | 140 |
| 32 | 348.25 | 45944 | 86 | 99 | 107 | 132 | 164 |
| 36 | 355.661 | 50610 | 96 | 110 | 118 | 143 | 166 |
| 40 | 350.957 | 56987 | 109 | 124 | 133 | 165 | 221 |
### Serving 32线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 87.4088 | 22881 | 43 | 48 | 52 | 70 | 86 |
| 8 | 150.733 | 26537 | 50 | 60 | 68 | 85 | 102 |
| 12 | 197.433 | 30390 | 57 | 70 | 75 | 90 | 106 |
| 16 | 250.917 | 31883 | 60 | 73 | 78 | 94 | 121 |
| 20 | 286.369 | 34920 | 66 | 78 | 84 | 102 | 131 |
| 24 | 306.029 | 39212 | 74 | 85 | 92 | 110 | 134 |
| 28 | 323.902 | 43223 | 81 | 93 | 100 | 122 | 143 |
| 32 | 341.559 | 46844 | 89 | 102 | 111 | 136 | 161 |
| 36 | 341.077 | 52774 | 98 | 113 | 124 | 158 | 193 |
| 40 | 357.814 | 55895 | 107 | 122 | 133 | 166 | 196 |
### Serving 36线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 86.9036 | 23014 | 44 | 49 | 53 | 72 | 112 |
| 8 | 158.964 | 25163 | 48 | 55 | 63 | 79 | 91 |
| 12 | 205.086 | 29256 | 55 | 68 | 75 | 91 | 168 |
| 16 | 238.173 | 33589 | 61 | 73 | 79 | 100 | 158 |
| 20 | 279.705 | 35752 | 67 | 79 | 86 | 106 | 129 |
| 24 | 318.294 | 37701 | 71 | 82 | 89 | 108 | 129 |
| 28 | 336.296 | 41630 | 78 | 89 | 97 | 119 | 194 |
| 32 | 360.295 | 44408 | 84 | 97 | 105 | 130 | 154 |
| 36 | 353.08 | 50980 | 96 | 113 | 123 | 152 | 179 |
| 40 | 362.286 | 55205 | 105 | 122 | 134 | 171 | 247 |
### Serving 40线程
| 并发数 | QPS | 总时间 | 平均响应时间 | 80分位点响应时间 | 90分位点响应时间 | 99分位点响应时间 | 99.9分位点响应时间 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 4 | 87.7347 | 22796 | 44 | 48 | 54 | 73 | 114 |
| 8 | 150.483 | 26581 | 50 | 59 | 67 | 85 | 149 |
| 12 | 202.088 | 29690 | 56 | 69 | 75 | 90 | 102 |
| 16 | 250.485 | 31938 | 60 | 74 | 79 | 93 | 113 |
| 20 | 289.62 | 34528 | 65 | 77 | 83 | 102 | 132 |
| 24 | 314.408 | 38167 | 72 | 83 | 90 | 110 | 125 |
| 28 | 321.728 | 43515 | 83 | 95 | 104 | 132 | 159 |
| 32 | 335.022 | 47758 | 90 | 104 | 114 | 141 | 166 |
| 36 | 341.452 | 52716 | 101 | 117 | 129 | 170 | 231 |
| 40 | 347.953 | 57479 | 109 | 130 | 143 | 182 | 216 |
下图是Paddle Serving在CNN模型上QPS随serving端线程数增加而变化的图表。可以看出,随着线程数变大,Serving QPS有较为明显的线性增长关系。可以这样解释此图表:例如,线程数为16时,基本在20个并发时达到最大QPS,此后再增加并发压力QPS基本保持稳定;当线程能够数为24线程时,基本在28并发时达到最大QPS,此后再增大并发压力qps基本保持稳定。
![](../images/qps-threads-cnn.png)
(右键在新窗口中浏览大图)
# Paddle Serving中的模型热加载 # Paddle Serving中的模型热加载
(简体中文|[English](HOT_LOADING_IN_SERVING.md)) (简体中文|[English](./Hot_Loading_EN.md))
## 背景 ## 背景
......
# Hot Loading in Paddle Serving # Hot Loading in Paddle Serving
([简体中文](HOT_LOADING_IN_SERVING_CN.md)|English) ([简体中文](./Hot_Loading_CN.md)|English)
## Background ## Background
......
...@@ -7,19 +7,17 @@ Paddle Serving服务端目前提供了支持Http直接访问的功能,本文 ...@@ -7,19 +7,17 @@ Paddle Serving服务端目前提供了支持Http直接访问的功能,本文
BRPC-Server端支持通过Http的方式被访问,各种语言都有实现Http请求的一些库,所以Java/Python/Go等BRPC支持不太完善的语言,可以通过Http的方式直接访问服务端进行预测。 BRPC-Server端支持通过Http的方式被访问,各种语言都有实现Http请求的一些库,所以Java/Python/Go等BRPC支持不太完善的语言,可以通过Http的方式直接访问服务端进行预测。
### Http方式 ### Http方式
基本流程和原理:客户端需要将数据按照Proto约定的格式(请参阅[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto))封装在Http请求的请求体中。 基本流程和原理:客户端需要将数据按照Proto约定的格式(请参阅[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto))封装在Http请求的请求体中。
BRPC-Server会尝试去JSON字符串中再去反序列化出Proto格式的数据,从而进行后续的处理。 BRPC-Server会尝试去JSON字符串中再去反序列化出Proto格式的数据,从而进行后续的处理。
### Http+protobuf方式 ### Http+protobuf方式
各种语言都提供了对ProtoBuf的支持,如果您对此比较熟悉,您也可以先将数据使用ProtoBuf序列化,再将序列化后的数据放入Http请求数据体中,然后指定Content-Type: application/proto,从而使用http/h2+protobuf二进制串访问服务。 各种语言都提供了对ProtoBuf的支持,如果您对此比较熟悉,您也可以先将数据使用ProtoBuf序列化,再将序列化后的数据放入Http请求数据体中,然后指定Content-Type: application/proto,从而使用http/h2+protobuf二进制串访问服务。
实测随着数据量的增大,使用JSON方式的Http的数据量和反序列化的耗时会大幅度增加,推荐当您的数据量较大时,使用Http+protobuf方式,目前已经在Java和Python的Client端提供了支持。 实测随着数据量的增大,使用JSON方式的Http的数据量和反序列化的耗时会大幅度增加,推荐当您的数据量较大时,使用Http+protobuf方式,目前已经在Java和Python的Client端提供了支持。
**理论上讲,序列化/反序列化的性能从高到底排序为:protobuf > http/h2+protobuf > http**
## 示例 ## 示例
我们将以python/examples/fit_a_line为例,讲解如何通过Http访问Server端。 我们将以examples/C++/fit_a_line为例,讲解如何通过Http访问Server端。
### 获取模型 ### 获取模型
...@@ -42,13 +40,13 @@ python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 - ...@@ -42,13 +40,13 @@ python3.6 -m paddle_serving_server.serve --model uci_housing_model --thread 10 -
为了方便用户快速的使用Http方式请求Server端预测服务,我们已经将常用的Http请求的数据体封装、压缩、请求加密等功能封装为一个HttpClient类提供给用户,方便用户使用。 为了方便用户快速的使用Http方式请求Server端预测服务,我们已经将常用的Http请求的数据体封装、压缩、请求加密等功能封装为一个HttpClient类提供给用户,方便用户使用。
使用HttpClient最简单只需要四步,1、创建一个HttpClient对象。2、加载Client端的prototxt配置文件(本例中为python/examples/fit_a_line/目录下的uci_housing_client/serving_client_conf.prototxt)。3、调用connect函数。4、调用Predict函数,通过Http方式请求预测服务。 使用HttpClient最简单只需要四步,1、创建一个HttpClient对象。2、加载Client端的prototxt配置文件(本例中为examples/C++/fit_a_line目录下的uci_housing_client/serving_client_conf.prototxt)。3、调用connect函数。4、调用Predict函数,通过Http方式请求预测服务。
此外,您可以根据自己的需要配置Server端IP、Port、服务名称(此服务名称需要与[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto)文件中的Service服务名和rpc方法名对应,即`GeneralModelService`字段和`inference`字段),设置Request数据体压缩,设置Response支持压缩传输,模型加密预测(需要配置Server端使用模型加密)、设置响应超时时间等功能。 此外,您可以根据自己的需要配置Server端IP、Port、服务名称(此服务名称需要与[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto)文件中的Service服务名和rpc方法名对应,即`GeneralModelService`字段和`inference`字段),设置Request数据体压缩,设置Response支持压缩传输,模型加密预测(需要配置Server端使用模型加密)、设置响应超时时间等功能。
Python的HttpClient使用示例见[`python/examples/fit_a_line/test_httpclient.py`](../python/examples/fit_a_line/test_httpclient.py),接口详见[`python/paddle_serving_client/httpclient.py`](../python/paddle_serving_client/httpclient.py) Python的HttpClient使用示例见[`examples/C++/fit_a_line/test_httpclient.py`](../../examples/C++/fit_a_line/test_httpclient.py),接口详见[`python/paddle_serving_client/httpclient.py`](../../python/paddle_serving_client/httpclient.py)
Java的HttpClient使用示例见[`java/examples/src/main/java/PaddleServingClientExample.java`](../java/examples/src/main/java/PaddleServingClientExample.java)接口详见[`java/src/main/java/io/paddle/serving/client/HttpClient.java`](../java/src/main/java/io/paddle/serving/client/HttpClient.java) Java的HttpClient使用示例见[`java/examples/src/main/java/PaddleServingClientExample.java`](../../java/examples/src/main/java/PaddleServingClientExample.java)接口详见[`java/src/main/java/io/paddle/serving/client/Client.java`](../../java/src/main/java/io/paddle/serving/client/Client.java)
如果不能满足您的需求,您也可以在此基础上添加一些功能。 如果不能满足您的需求,您也可以在此基础上添加一些功能。
...@@ -64,7 +62,7 @@ curl -XPOST http://0.0.0.0:9393/GeneralModelService/inference -d ' {"tensor":[{" ...@@ -64,7 +62,7 @@ curl -XPOST http://0.0.0.0:9393/GeneralModelService/inference -d ' {"tensor":[{"
``` ```
其中`127.0.0.1:9393`为IP和Port,根据您服务端启动的IP和Port自行设定。 其中`127.0.0.1:9393`为IP和Port,根据您服务端启动的IP和Port自行设定。
`GeneralModelService`字段和`inference`字段分别为Proto文件中的Service服务名和rpc方法名,详见[`core/general-server/proto/general_model_service.proto`](../core/general-server/proto/general_model_service.proto) `GeneralModelService`字段和`inference`字段分别为Proto文件中的Service服务名和rpc方法名,详见[`core/general-server/proto/general_model_service.proto`](../../core/general-server/proto/general_model_service.proto)
-d后面的是请求的数据体,json中一定要包含上述proto中的required字段,否则转化会失败,对应请求会被拒绝。 -d后面的是请求的数据体,json中一定要包含上述proto中的required字段,否则转化会失败,对应请求会被拒绝。
......
# Inference Protocols
C++ Serving基于BRPC进行服务构建,支持BRPC、GRPC、RESTful请求。请求数据为protobuf格式,详见`core/general-server/proto/general_model_service.proto`。本文介绍构建请求以及解析结果的方法。
## Tensor
Tensor可以装载多种类型的数据,是Request和Response的基础单元。Tensor的具体定义如下:
```protobuf
message Tensor {
// VarType: INT64
repeated int64 int64_data = 1;
// VarType: FP32
repeated float float_data = 2;
// VarType: INT32
repeated int32 int_data = 3;
// VarType: FP64
repeated double float64_data = 4;
// VarType: UINT32
repeated uint32 uint32_data = 5;
// VarType: BOOL
repeated bool bool_data = 6;
// (No support)VarType: COMPLEX64, 2x represents the real part, 2x+1
// represents the imaginary part
repeated float complex64_data = 7;
// (No support)VarType: COMPLEX128, 2x represents the real part, 2x+1
// represents the imaginary part
repeated double complex128_data = 8;
// VarType: STRING
repeated string data = 9;
// Element types:
// 0 => INT64
// 1 => FP32
// 2 => INT32
// 3 => FP64
// 4 => INT16
// 5 => FP16
// 6 => BF16
// 7 => UINT8
// 8 => INT8
// 9 => BOOL
// 10 => COMPLEX64
// 11 => COMPLEX128
// 20 => STRING
int32 elem_type = 10;
// Shape of the tensor, including batch dimensions.
repeated int32 shape = 11;
// Level of data(LOD), support variable length data, only for fetch tensor
// currently.
repeated int32 lod = 12;
// Correspond to the variable 'name' in the model description prototxt.
string name = 13;
// Correspond to the variable 'alias_name' in the model description prototxt.
string alias_name = 14; // get from the Model prototxt
// VarType: FP16, INT16, INT8, BF16, UINT8
bytes tensor_content = 15;
};
```
- elem_type:数据类型,当前支持FLOAT32, INT64, INT32, UINT8, INT8, FLOAT16
|elem_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:数据维度
- lod:lod信息,LoD(Level-of-Detail) Tensor是Paddle的高级特性,是对Tensor的一种扩充,用于支持更自由的数据输入。详见[LOD](../LOD_CN.md)
- name/alias_name: 名称及别名,与模型配置对应
### 构建FLOAT32数据Tensor
```C
// 原始数据
std::vector<float> float_data;
Tensor *tensor = new Tensor;
// 设置维度,可以设置多维
for (uint32_t j = 0; j < float_shape.size(); ++j) {
tensor->add_shape(float_shape[j]);
}
// 设置LOD信息
for (uint32_t j = 0; j < float_lod.size(); ++j) {
tensor->add_lod(float_lod[j]);
}
// 设置类型、名称及别名
tensor->set_elem_type(1);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
// 拷贝数据
int total_number = float_data.size();
tensor->mutable_float_data()->Resize(total_number, 0);
memcpy(tensor->mutable_float_data()->mutable_data(), float_datadata(), total_number * sizeof(float));
```
### 构建INT8数据Tensor
```C
// 原始数据
std::string string_data;
Tensor *tensor = new Tensor;
for (uint32_t j = 0; j < string_shape.size(); ++j) {
tensor->add_shape(string_shape[j]);
}
for (uint32_t j = 0; j < string_lod.size(); ++j) {
tensor->add_lod(string_lod[j]);
}
tensor->set_elem_type(8);
tensor->set_name(name);
tensor->set_alias_name(alias_name);
tensor->set_tensor_content(string_data);
```
## Request
Request为客户端需要发送的请求数据,其以Tensor为基础数据单元,并包含了额外的请求信息。定义如下:
```protobuf
message Request {
repeated Tensor tensor = 1;
repeated string fetch_var_names = 2;
bool profile_server = 3;
uint64 log_id = 4;
};
```
- fetch_vat_names: 需要获取的输出数据名称,在GeneralResponseOP会根据该列表进行过滤.请参考模型文件serving_client_conf.prototxt中的`fetch_var`字段下的`alias_name`
- profile_server: 调试参数,打开时会输出性能信息
- log_id: 请求ID
### 构建Request
当使用BRPC或GRPC进行请求时,使用protobuf形式数据,构建方式如下:
```C
Request req;
req.set_log_id(log_id);
for (auto &name : fetch_name) {
req.add_fetch_var_names(name);
}
// 添加Tensor
Tensor *tensor = req.add_tensor();
...
```
当使用RESTful请求时,可以使用JSON形式数据,具体格式如下:
```JSON
{"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"name":"x","alias_name":"x","shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}
```
## Response
Response为服务端返回给客户端的结果,包含了Tensor数据、错误码、错误信息等。定义如下:
```protobuf
message Response {
repeated ModelOutput outputs = 1;
repeated int64 profile_time = 2;
// Error code
int32 err_no = 3;
// Error messages
string err_msg = 4;
};
message ModelOutput {
repeated Tensor tensor = 1;
string engine_name = 2;
}
```
- profile_time:当设置request->set_profile_server(true)时,会返回性能信息
- err_no:错误码,详见`core/predictor/common/constant.h`
- err_msg:错误信息,详见`core/predictor/common/constant.h`
- engine_name:输出节点名称
|err_no|err_msg|
|---------|----|
|0|OK|
|-5000|"Paddle Serving Framework Internal Error."|
|-5001|"Paddle Serving Memory Alloc Error."|
|-5002|"Paddle Serving Array Overflow Error."|
|-5100|"Paddle Serving Op Inference Error."|
### 读取Response数据
```C
uint32_t model_num = res.outputs_size();
for (uint32_t m_idx = 0; m_idx < model_num; ++m_idx) {
std::string engine_name = output.engine_name();
int idx = 0;
// 读取tensor维度
int shape_size = output.tensor(idx).shape_size();
for (int i = 0; i < shape_size; ++i) {
shape[i] = output.tensor(idx).shape(i);
}
// 读取LOD信息
int lod_size = output.tensor(idx).lod_size();
if (lod_size > 0) {
lod.resize(lod_size);
for (int i = 0; i < lod_size; ++i) {
lod[i] = output.tensor(idx).lod(i);
}
}
// 读取float数据
int size = output.tensor(idx).float_data_size();
float_data = std::vector<float>(
output.tensor(idx).float_data().begin(),
output.tensor(idx).float_data().begin() + size);
// 读取int8数据
string_data = output.tensor(idx).tensor_content();
}
```
\ No newline at end of file
# C++ Serving 简要介绍
## 适用场景
C++ Serving主打性能,如果您想搭建企业级的高性能线上推理服务,对高并发、低延时有一定的要求。C++ Serving框架可能会更适合您。目前无论是使用同步/异步模型,[C++ Serving与TensorFlow Serving性能对比](./Benchmark_CN.md)均有优势。
C++ Serving网络框架使用brpc,核心执行引擎是基于C/C++编写,并且提供强大的工业级应用能力,包括模型热加载、模型加密部署、A/B Test、多模型组合、同步/异步模式、支持多语言多协议Client等功能。
## 1.网络框架(BRPC)
C++ Serving采用[brpc框架](https://github.com/apache/incubator-brpc)进行Client/Server端的通信。brpc是百度开源的一款PRC网络框架,具有高并发、低延时等特点,已经支持了包括百度在内上百万在线预估实例、上千个在线预估服务,稳定可靠。与gRPC网络框架相比,具有更低的延时,更高的并发性能,且底层支持<mark>**brpc/grpc/http+json/http+proto**</mark>等多种协议;缺点是跨操作系统平台能力不足。详细的框架性能开销见[C++ Serving框架性能测试](./Frame_Performance_CN.md)
## 2.核心执行引擎
C++ Serving的核心执行引擎是一个有向无环图(也称作[DAG图](./DAG_CN.md)),DAG图中的每个节点(在PaddleServing中,借用模型中operator算子的概念,将DAG图中的节点也称为[OP](./OP_CN.md))代表预估服务的一个环节,DAG图支持多个OP按照串并联的方式进行组合,从而实现在一个服务中完成多个模型的预测整合最终产出结果。整个框架原理如下图所示,可分为Client Side 和 Server Side。
<p align="center">
<br>
<img src='../images/design_doc.png'">
<br>
<p>
### 2.1 Client Side
如图所示,Client端通过Pybind API接口将Request请求,按照ProtoBuf协议进行序列化后,经由BRPC网络框架Client端发送给Server端。此时,Client端等待Server端的返回数据并反序列化为正常的数据,之后将结果返给Client调用方。
### 2.2 Server Side
Server端接收到序列化的Request请求后,反序列化正常数据,进入图执行引擎,按照定义好的DAG图结构,执行每个OP环节的操作(每个OP环节的处理由用户定义,即可以只是单纯的数据处理,也可以是调用预测引擎用不同的模型对输入数据进行预测),当DAG图中所有OP环节均执行完成后,将结果数据序列化后返回给Client端。
### 2.3 通信数据格式ProtoBuf
Protocol Buffers(简称Protobuf) ,是Google出品的序列化框架,与开发语言无关,和平台无关,具有良好的可扩展性。Protobuf和所有的序列化框架一样,都可以用于数据存储、通讯协议。Protobuf支持生成代码的语言包括Java、Python、C++、Go、JavaNano、Ruby、C#。Portobuf的序列化的结果体积要比XML、JSON小很多,速度比XML、JSON快很多。
在C++ Serving中定义了Client Side 和 Server Side之间通信的ProtoBuf,详细的字段的介绍见《[C++ Serving ProtoBuf简介](./Inference_Protocols_CN.md)》。
## 3.Server端特性
### 3.1 启动Server端
Server端的核心是一个由项目代码编译产生的名称为serving的二进制可执行文件,启动serving时需要用户指定一些参数(<mark>**例如,网络IP和Port端口、brpc线程数、使用哪个显卡、模型文件路径、模型是否开启trt、XPU推理、模型精度设置等等**</mark>),有些参数是通过命令行直接传入的,还有一些是写在指定的配置文件中配置文件中。
为了方便用户快速的启动C++ Serving的Server端,除了用户自行修改配置文件并通过命令行传参运行serving二进制可执行文件以外,我们也提供了另外一种通过python脚本启动的方式。python脚本启动本质上仍是运行serving二进制可执行文件,但python脚本中会自动完成两件事:1、配置文件的生成;2、根据需要配置的参数,生成命令行,通过命令行的方式,传入参数信息并运行serving二进制可执行文件。
更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
### 3.2 同步/异步模式
同步模式比较简单直接,适用于模型预测时间短,单个Request请求的batch已经比较大的情况。
同步模型下,Server端线程数N = 模型预测引擎数N = 同时处理Request请求数N,超发的Request请求需要等待当前线程处理结束后才能得到响应和处理。
<p align="center">
<img src='../images/syn_mode.png' width = "350" height = "300">
<p>
异步模型主要适用于模型支持多batch(最大batch数M可通过配置选项指定),单个Request请求的batch较小(batch << M),单次预测时间较长的情况。
异步模型下,Server端N个线程只负责接收Request请求,实际调用预测引擎是在异步框架的线程池中,异步框架的线程数可以由配置选项来指定。为了方便理解,我们假设每个Request请求的batch均为1,此时异步框架会尽可能多得从请求池中取n(n≤M)个Request并将其拼装为1个Request(batch=n),调用1次预测引擎,得到1个Response(batch = n),再将其对应拆分为n个Response作为返回结果。
<p align="center">
<img src='../images/asyn_mode.png'">
<p>
更多关于模式参数配置以及性能调优的介绍见《[C++ Serving性能调优](./Performance_Tuning_CN.md)》。
### 3.3 多模型组合
当用户需要多个模型组合处理结果来作为一个服务接口对外暴露时,通常的解决办法是搭建内外两层服务,内层服务负责跑模型预测,外层服务负责串联和前后处理。当传输的数据量不大时,这样做的性能开销并不大,但当输出的数据量较大时,因为网络传输而带来的性能开销不容忽视(实测单次传输40MB数据时,RPC耗时为160-170ms)。
<p align="center">
<br>
<img src='../images/multi_model.png'>
<br>
<p>
C++ Serving框架支持[自定义DAG图](./Model_Ensemble_CN.md)的方式来表示多模型之间串并联组合关系,也支持用户[使用C++开发自定义OP节点](./OP_CN.md)。相比于使用内外两层服务来提供多模型组合处理的方式,由于节省了一次RPC网络传输的开销,把多模型在一个服务中处理性能上会有一定的提升,尤其当RPC通信传输的数据量较大时。
### 3.4 模型管理与热加载
C++ Serving的引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。C++ Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[C++ Serving中的模型热加载](./Hot_Loading_CN.md)》。
### 3.5 模型加解密
C++ Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[C++ Serving加密模型预测](./Encryption_CN.md)》。
## 4.Client端特性
### 4.1 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](./ABTest_CN.md)》。
<p align="center">
<br>
<img src='../images/abtest.png' width = "345" height = "230">
<br>
<p>
### 4.2 多语言多协议Client
BRPC网络框架支持[多种底层通信协议](#1.网络框架(BRPC)),即使用目前的C++ Serving框架的Server端,各种语言的Client端,甚至使用curl的方式,只要按照上述协议(具体支持的协议见[brpc官网](https://github.com/apache/incubator-brpc))封装数据并发送,Server端就能够接收、处理和返回结果。
对于支持的各种协议我们提供了部分的Client SDK示例供用户参考和使用,用户也可以根据自己的需求去开发新的Client SDK,也欢迎用户添加其他语言/协议(例如GRPC-Go、GRPC-C++ HTTP2-Go、HTTP2-Java等)Client SDK到我们的仓库供其他开发者借鉴和参考。
| 通信协议 | 速度 | 是否支持 | 是否提供Client SDK |
|-------------|-----|---------|-------------------|
| BRPC | 最快 | 支持 | [C++](../../core/general-client/README_CN.md)[Python(Pybind方式)](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP2+Proto | 快 | 支持 | coming soon |
| GRPC | 快 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Proto | 一般 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md) |
| HTTP1+Json | 慢 | 支持 | [Java](../../java/README_CN.md)[Python](../../examples/C++/fit_a_line/README_CN.md)[Curl](Http_Service_CN.md) |
# Paddle Serving中的集成预测 # Paddle Serving中的集成预测
(简体中文|[English](MODEL_ENSEMBLE_IN_PADDLE_SERVING.md)) (简体中文|[English](./Model_Ensemble_EN.md))
在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。 在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。 需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
样例中用到的代码保存在`python/examples/imdb`路径下: 样例中用到的代码保存在`examples/C++/imdb`路径下:
```shell ```shell
. .
......
# Model Ensemble in Paddle Serving # Model Ensemble in Paddle Serving
([简体中文](MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)|English) ([简体中文](Model_Ensemble_CN.md)|English)
In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature. In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature.
...@@ -14,7 +14,7 @@ In this example (see the figure below), the server side predict the bow and CNN ...@@ -14,7 +14,7 @@ In this example (see the figure below), the server side predict the bow and CNN
It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same. It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.
The code used in the example is saved in the `python/examples/imdb` path: The code used in the example is saved in the `examples/C++/imdb` path:
```shell ```shell
. .
......
# 如何开发一个新的General Op? # 如何开发一个新的General Op?
(简体中文|[English](./NEW_OPERATOR.md)) (简体中文|[English](./OP_EN.md))
在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。 在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
......
# How to write an general operator? # How to write an general operator?
([简体中文](./NEW_OPERATOR_CN.md)|English) ([简体中文](./OP_CN.md)|English)
In this document, we mainly focus on how to develop a new server side operator for PaddleServing. Before we start to write a new operator, let's look at some sample code to get the basic idea of writing a new operator for server. We assume you have known the basic computation logic on server side of PaddleServing, please reference to []() if you do not know much about it. The following code can be visited at `core/general-server/op` of Serving repo. In this document, we mainly focus on how to develop a new server side operator for PaddleServing. Before we start to write a new operator, let's look at some sample code to get the basic idea of writing a new operator for server. We assume you have known the basic computation logic on server side of PaddleServing, please reference to []() if you do not know much about it. The following code can be visited at `core/general-server/op` of Serving repo.
......
# C++ Serving性能分析与优化
# 1.背景知识介绍
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](./Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
2) 关于C++ Serving框架本身的性能分析和介绍,请参考[C++ Serving框架性能测试](./Frame_Performance_CN.md)
3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解,例如,您使用CPU还是GPU进行预测;是否可以开启TRT进行加速;你的机器CPU是多少core的;您的业务包含几个模型;每个模型的输入和输出需要做些什么处理;您业务的最大线上流量是多少;您的模型支持的最大输入batch是多少等等.
# 2.Server线程数
首先,Server端线程数N并不是越大越好。众所周知,线程的切换涉及到用户空间和内核空间的切换,有一定的开销,当您的core数=1,而线程数为100000时,线程的频繁切换将带来不可忽视的性能开销。
在BRPC框架中,用户态协程worker数M >> 线程数N,用户态协程worker会工作在任意一个线程中,当RPC网络传输IO操作让出CPU资源时,BRPC会进行用户态协程worker的切换从而提高RPC框架的并发性。所以,极端情况下,若您的代码中除RPC通信外,没有阻塞线程的任何IO或网络操作,您的线程数完全可以 == 机器core数量,您不必担心N个线程都在进行RPC网络IO,而导致CPU利用率不高的问题。
Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合考虑:
## 2.1 最大并发请求量M
根据最大并发请求量来设置Server端线程数N,根据[C++ Serving框架性能测试](./Frame_Performance_CN.md)中的数据来看,此时<mark>**线程数N应等于或略小于最大并发请求量M**</mark>,此时平均处理时延最小。
这也很容易理解,举个极端的例子,如果您每次只有1个请求,那此时Server端线程数设置1是最合理的,因为此时没有任何线程切换的开销。如果您设置线程数为任何大于1的数,必然就带来了线程切换的开销。
## 2.2 机器core数量C
根据机器core数量来设置Server端线程数N,众所周知,线程是CPU core调度执行的最小单元,若要在一个进程内充分使用所有的core,<mark>**线程数至少应该>=机器core数量C**</mark>,但具体线程数N/机器core数量C = ?需要您根据您的代码中网络、IO、内存和计算所占用的比例来决定,一般用户可以通过设置不同的线程数来测试CPU占用率来不断调整。
## 2.3 模型预测时间长短T
当您使用CPU进行预测时,预测阶段的计算是使用CPU完成的,此时,请参考前两者来进行设置线程数。
当您使用GPU进行预测时,情况有些不同,此时预测阶段的计算是由GPU完成的,此时CPU资源是空闲的,而预测操作是阻塞该线程的,类似于Sleep操作,此时若您的线程数==机器core数量,将没有其他可切换的线程从而导致必然有部分core是空闲的状态。具体来说,当模型预测时间较短时(<10ms),Server端线程数不宜过多(线程数=1~10倍core数量),否则线程切换带来的开销不可忽视。当模型预测时间较长时,Server端线程数应稍大一些(线程数=4~200倍core数量)。
# 3.异步模式
<mark>**大部分用户的Request请求batch数<<模型最大支持的Batch数**</mark>时,采用异步模式的收益是明显的。
异步模型的原理是将模型预测阶段与RPC线程脱离,模型单独开辟一个线程数可指定的线程池,RPC收到Request后将请求数据放入模型的线程池中的Task队列中,线程池中的线程从Task中取出数据合并Batch后进行预测,从而提升QPS,更多详细的介绍见[C++Serving功能简介](./Introduction_CN.md),同步模式与异步模式的数据对比见[C++ Serving vs TensorFlow Serving 性能对比](./Benchmark_CN.md),在上述测试的条件下,异步模型比同步模式快百分50%。
异步模式的开启有以下两种方式。
## 3.1 Python命令辅助启动C++Server
`python3 -m paddle_serving_server.serve`通过添加`--runtime_thread_num 2`指定该模型开启异步模式,其中2表示的是该模型异步线程池中的线程数为2,该数值默认值为0,此时表示不使用异步模式。`--runtime_thread_num`的具体数值设置根据模型、数据和显卡的可用显存来设置。
通过添加`--batch_infer_size 32`来设置模型最大允许Batch == 32 的输入,此参数只有在异步模型开启的状态下,才有效。
## 3.2 命令行+配置文件启动C++Server
此时通过修改`model_toolkit.prototxt`中的`runtime_thread_num`字段和`batch_infer_size`字段同样能达到上述效果。
# 4.多模型组合
<mark>**您的业务中需要调用多个模型进行预测**</mark>时,如果您追求极致的性能,您可以考虑使用C++Serving[自定义OP](./OP_CN.md)[自定义DAG图](./DAG_CN.md)的方式来实现上述需求。
## 4.1 优点
由于在一个服务中做模型的组合,节省了网络IO的时间和序列化反序列化的时间,尤其当数据量比较大时,收益十分明显(实测单次传输40MB数据时,RPC耗时为160-170ms)。
## 4.2 缺点
1) 需要使用C++去自定义OP和自定义DAG图去定义模型之间的组合关系。
2) 若多个模型之间需要前后处理,您也需要使用C++在OP之间去编写这部分代码。
3) 需要重新编译Server端代码。
## 4.3 示例
请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)`C++ OCR Service服务章节`[Paddle Serving中的集成预测](./Model_Ensemble_CN.md)中的例子。
# How to compile PaddleServing
([简体中文](./COMPILE_CN.md)|English)
## Compilation environment requirements
| module | version |
| :--------------------------: | :-------------------------------: |
| OS | Ubuntu16 and 18/CentOS 7 |
| gcc | 5.4.0(Cuda 10.1) and 8.2.0 |
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later |
| Python | 3.6.0 and later |
| Go | 1.9.2 and later |
| git | 2.17.1 and later |
| glibc-static | 2.17 |
| openssl-devel | 1.0.2k |
| bzip2-devel | 1.0.6 and later |
| python3-devel | 3.6.0 and later |
| sqlite-devel | 3.7.17 and later |
| patchelf | 0.9 |
| libXext | 1.3.3 |
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md).
## Get Code
``` python
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
```
## PYTHONROOT settings
```shell
# For example, the path of python is /usr/bin/python, you can set PYTHONROOT
export PYTHONROOT=/usr
```
If you are using a Docker development image, please follow the following to determine the Python version to be compiled, and set the corresponding environment variables
```
#Python3.6
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.6m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.6m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.6
#Python3.7
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
#Python3.8
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.8
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.8.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.8
```
## Install Python dependencies
```shell
pip install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple
```
If you use other Python version, please use the right `pip` accordingly.
## GOPATH Setting
The default GOPATH is set to `$HOME/go`, you can also set it to other values. **If it is the Docker environment provided by Serving, you do not need to set up.**
```shell
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
```
## Get go packages
```shell
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go get -u github.com/golang/protobuf/protoc-gen-go@v1.4.3
go get -u google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
## Compile Server
### Integrated CPU version paddle inference library
``` shell
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON ..
make -j10
```
you can execute `make install` to put targets under directory `./output`, you need to add`-DCMAKE_INSTALL_PREFIX=./output`to specify output path to cmake command shown above.
### Integrated GPU version paddle inference library
Compared with CPU environment, GPU environment needs to refer to the following table,
**It should be noted that the following table is used as a reference for non-Docker compilation environment. The Docker compilation environment has been configured with relevant parameters and does not need to be specified in cmake process. **
| cmake environment variable | meaning | GPU environment considerations | whether Docker environment is needed |
|-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda installation path, usually /usr/local/cuda | Required for all environments | No (/usr/local/cuda) |
| CUDNN_LIBRARY | The directory where libcudnn.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | The directory where libcudart.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | The upper level directory of the directory where libnvinfer.so.* is located, depends on the TensorRT installation directory | Cuda 9.0/10.0 does not need, other needs | No (/usr) |
If not in Docker environment, users can refer to the following execution methods. The specific path is subject to the current environment, and the code is only for reference.TENSORRT_LIBRARY_PATH is related to the TensorRT version and should be set according to the actual situation。For example, in the cuda10.1 environment, the TensorRT version is 6.0 (/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/),In the cuda10.2 and cuda11.0 environment, the TensorRT version is 7.1 (/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/).
``` shell
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"
mkdir server-build-gpu && cd server-build-gpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCUDA_TOOLKIT_ROOT_DIR=${CUDA_PATH} \
-DCUDNN_LIBRARY=${CUDNN_LIBRARY} \
-DCUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY} \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \
-DWITH_GPU=ON ..
make -j10
```
Execute `make install` to put the target output in the `./output` directory.
### Compile C++ Server under the condition of WITH_OPENCV=ON
**Note:** Only when you need to redevelop the paddle serving C + + part, and the new code depends on the OpenCV library, you need to do so.
First of all , OpenCV library should be installed, if not, please refer to the `Compile and install OpenCV` section later in this article.
In the compile command, add `DOPENCV_DIR=${OPENCV_DIR}` and `DWITH_OPENCV=ON`,for example:
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir` is the installation path of OpenCV library。
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes).
## Compile Client
``` shell
mkdir client-build && cd client-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON ..
make -j10
```
execute `make install` to put targets under directory `./output`
## Compile the App
```bash
mkdir app-build && cd app-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON ..
make
```
## Install wheel package
Regardless of the client, server or App part, after compiling, install the whl package in `python/dist/` in the temporary directory(`server-build-cpu`, `server-build-gpu`, `client-build`,`app-build`) of the compilation process.
for example:cd server-build-cpu/python/dist && pip install -U xxxxx.whl
## Note
When running the python server, it will check the `SERVING_BIN` environment variable. If you want to use your own compiled binary file, set the environment variable to the path of the corresponding binary file, usually`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`.
BUILD_DIR is the absolute path of server build CPU or server build GPU。
for example: cd server-build-cpu && export SERVING_BIN=${PWD}/core/general-server/serving
## Verify
Please use the example under `python/examples` to verify.
## CMake Option Description
| Compile Options | Description | Default |
| :--------------: | :----------------------------------------: | :--: |
| WITH_AVX | Compile Paddle Serving with AVX intrinsics | OFF |
| WITH_MKL | Compile Paddle Serving with MKL support | OFF |
| WITH_GPU | Compile Paddle Serving with NVIDIA GPU | OFF |
| WITH_OPENCV | Compile Paddle Serving with OPENCV | OFF |
| CUDNN_LIBRARY | Define CuDNN library and header path | |
| CUDA_TOOLKIT_ROOT_DIR | Define CUDA PATH | |
| TENSORRT_ROOT | Define TensorRT PATH | |
| CLIENT | Compile Paddle Serving Client | OFF |
| SERVER | Compile Paddle Serving Server | OFF |
| APP | Compile Paddle Serving App package | OFF |
| PACK | Compile for whl | OFF |
### WITH_GPU Option
Paddle Serving supports prediction on the GPU through the PaddlePaddle inference library. The WITH_GPU option is used to detect basic libraries such as CUDA/CUDNN on the system. If an appropriate version is detected, the GPU Kernel will be compiled when PaddlePaddle is compiled.
To compile the Paddle Serving GPU version on bare metal, you need to install these basic libraries:
- CUDA
- CuDNN
To compile the TensorRT version, you need to install the TensorRT library.
Note here:
1. The basic library versions such as CUDA/CUDNN installed on the system where Serving is compiled, needs to be compatible with the actual GPU device. For example, the Tesla V100 card requires at least CUDA 9.0. If the version of the basic library such as CUDA used during compilation is too low, the generated GPU code is not compatible with the actual hardware device, which will cause the Serving process to fail to start or serious problems such as coredump.
2. Install the CUDA driver compatible with the actual GPU device on the system running Paddle Serving, and install the basic library compatible with the CUDA/CuDNN version used during compilation. If the version of CUDA/CuDNN installed on the system running Paddle Serving is lower than the version used at compile time, it may cause some cuda function call failures and other problems.
The following is the base library version matching relationship used by the PaddlePaddle release version for reference:
| | CUDA | CuDNN | TensorRT |
| :----: | :-----: | :----------: | :----: |
| post101 | 10.1 | CuDNN 7.6.5 | 6.0.1 |
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### How to make the compiler detect the CuDNN library
Download the corresponding CUDNN version from NVIDIA developer official website and decompressing it, add `-DCUDNN_ROOT` to cmake command, to specify the path of CUDNN.
## Compile and install OpenCV
**Note:** You need to do this only if you need to import the opencv library into your C + + code.
* First of all, you need to download the source code compiled package in the Linux environment from the OpenCV official website. Taking OpenCV3.4.7 as an example, the download command is as follows.
```
wget https://github.com/opencv/opencv/archive/3.4.7.tar.gz
tar -xf 3.4.7.tar.gz
```
Finally, you can see the folder of `opencv-3.4.7/` in the current directory.
* Compile OpenCV, the OpenCV source path (`root_path`) and installation path (`install_path`) should be set by yourself. Enter the OpenCV source code path and compile it in the following way.
```shell
root_path=your_opencv_root_path
install_path=${root_path}/opencv3
rm -rf build
mkdir build
cd build
cmake .. \
-DCMAKE_INSTALL_PREFIX=${install_path} \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DWITH_IPP=OFF \
-DBUILD_IPP_IW=OFF \
-DWITH_LAPACK=OFF \
-DWITH_EIGEN=OFF \
-DCMAKE_INSTALL_LIBDIR=lib64 \
-DWITH_ZLIB=ON \
-DBUILD_ZLIB=ON \
-DWITH_JPEG=ON \
-DBUILD_JPEG=ON \
-DWITH_PNG=ON \
-DBUILD_PNG=ON \
-DWITH_TIFF=ON \
-DBUILD_TIFF=ON
make -j
make install
```
Among them, `root_path` is the downloaded OpenCV source code path, and `install_path` is the installation path of OpenCV. After `make install` is completed, the OpenCV header file and library file will be generated in this folder for later source code compilation.
The final file structure under the OpenCV installation path is as follows.
```
opencv3/
|-- bin
|-- include
|-- lib
|-- lib64
|-- share
```
# 如何编译PaddleServing # 如何编译PaddleServing
(简体中文|[English](./COMPILE.md)) (简体中文|[English](./Compile_EN.md))
## 编译环境设置 ## 总体概述
编译Paddle Serving一共分以下几步
- 编译环境准备:根据模型和运行环境的需要,选择最合适的镜像
- 下载代码库:下载Serving代码库,按需要执行初始化操作
- 环境变量准备:根据运行环境的需要,确定Python各个环境变量,如GPU环境还需要确定Cuda,Cudnn,TensorRT等环境变量。
- 正式编译: 编译`paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app`相关whl包
- 安装相关whl包:安装编译出的三个whl包,并设置SERVING_BIN环境变量
此外,针对某些C++二次开发场景,我们也提供了OPENCV的联编方案。
## 编译环境准备
| 组件 | 版本要求 | | 组件 | 版本要求 |
| :--------------------------: | :-------------------------------: | | :--------------------------: | :-------------------------------: |
...@@ -11,7 +26,7 @@ ...@@ -11,7 +26,7 @@
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 | | gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later | | cmake | 3.2.0 and later |
| Python | 3.6.0 and later | | Python | 3.6.0 and later |
| Go | 1.9.2 and later | | Go | 1.17.2 and later |
| git | 2.17.1 and later | | git | 2.17.1 and later |
| glibc-static | 2.17 | | glibc-static | 2.17 |
| openssl-devel | 1.0.2k | | openssl-devel | 1.0.2k |
...@@ -23,109 +38,151 @@ ...@@ -23,109 +38,151 @@
| libSM | 1.2.2 | | libSM | 1.2.2 |
| libXrender | 0.9.10 | | libXrender | 0.9.10 |
推荐使用Docker编译,我们已经为您准备好了Paddle Serving编译环境并配置好了上述编译依赖,详见[该文档](DOCKER_IMAGES_CN.md) 推荐使用Docker编译,我们已经为您准备好了Paddle Serving编译环境并配置好了上述编译依赖,详见[该文档](Docker_Images_CN.md)
## 获取代码 我们提供了五个环境的开发镜像,分别是CPU, Cuda10.1+Cudnn7, Cuda10.2+Cudnn7,Cuda10.2+Cudnn8, Cuda11.2+Cudnn8。我们提供了Serving开发镜像涵盖以上环境。与此同时,我们也支持Paddle开发镜像。
``` python 其中Serving镜像名是 **paddlepaddle/serving:${Serving开发镜像Tag}**(如果网络不佳可以访问**registry.baidubce.com/paddlepaddle/serving:${Serving开发镜像Tag}**), Paddle开发镜像名是 **paddlepaddle/paddle:${Paddle开发镜像Tag}**。为了防止用户对两套镜像出现混淆,我们分别解释一下两套镜像的由来。
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive Serving开发镜像是Serving套件为了支持各个预测环境提供的用于编译、调试预测服务的镜像,Paddle开发镜像是Paddle在官网发布的用于编译、开发、训练模型使用镜像。为了让Paddle开发者能够在同一个容器内直接使用Serving。对于上个版本就已经使用Serving用户的开发者来说,Serving开发镜像应该不会感到陌生。但对于熟悉Paddle训练框架生态的开发者,目前应该更熟悉已有的Paddle开发镜像。为了适应所有用户的不同习惯,我们对这两套镜像都做了充分的支持。
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
我们首先要针对自己所需的环境拉取相关镜像。上表**环境**一列下,除了CPU,其余(Cuda**+Cudnn**)都属于GPU环境。
您可以使用Serving开发镜像。
``` ```
docker pull paddlepaddle/serving:${Serving开发镜像Tag}
## PYTHONROOT设置 # 如果是GPU镜像
nvidia-docker run --rm -it paddlepaddle/serving:${Serving开发镜像Tag} bash
```shell # 如果是CPU镜像
# 例如python的路径为/usr/bin/python,可以设置PYTHONROOT docker run --rm -it paddlepaddle/serving:${Serving开发镜像Tag} bash
export PYTHONROOT=/usr
``` ```
如果您使用的是Docker开发镜像,请按照如下,确定好需要编译的Python版本,设置对应的环境变量 也可以使用Paddle开发镜像。
``` ```
#Python3.6 docker pull paddlepaddle/paddle:${Paddle开发镜像Tag}
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.6m # 如果是GPU镜像,需要使用nvidia-docker
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.6m.so nvidia-docker run --rm -it paddlepaddle/paddle:${Paddle开发镜像Tag} bash
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.6
#Python3.7
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
#Python3.8
export PYTHONROOT=/usr/local/
export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.8
export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.8.so
export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.8
# 如果是CPU镜像
docker run --rm -it paddlepaddle/paddle:${Paddle开发镜像Tag} bash
``` ```
## 安装Python依赖
```shell ## 下载代码库
pip install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple **注明: 如果您正在使用Paddle开发镜像,需要在下载代码库后手动运行`bash env_install.sh`(如代码框的第三行所示)**
``` ```
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
如果使用其他Python版本,请使用对应版本的`pip` # Paddle开发镜像需要运行如下命令,Serving开发镜像不需要运行
bash tools/paddle_env_install.sh
```
## GOPATH 设置 ## 环境变量准备
默认 GOPATH 设置为 `$HOME/go`,您也可以设置为其他值。** 如果是Serving提供的Docker环境,可以不需要设置。** **设置PYTHON环境变量**
```shell
export GOPATH=$HOME/go 如果您使用的是Serving开发镜像,请按照如下,确定好需要编译的Python版本,设置对应的环境变量,一共需要设置三个环境变量,分别是`PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`。以下我们以python 3.7为例,介绍如何设置这三个环境变量。
export PATH=$PATH:$GOPATH/bin
1) 设置`PYTHON_INCLUDE_DIR`
搜索Python.h 所在的目录
```
find / -name Python.h
``` ```
通常会有类似于`**/include/python3.7/Python.h`出现,我们只需要取它的文件夹目录就好,比如找到`/usr/include/python3.7/Python.h`,那么我们只需要`export PYTHON_INCLUDE_DIR=/usr/include/python3.7/`就好。
如果没有找到。说明 1)没有安装开发版本的Python,需重新安装 2)权限不足无法查看相关系统目录。
## 获取 Go packages 2) 设置`PYTHON_LIBRARIES`
```shell 搜索 libpython3.7.so
go env -w GO111MODULE=on ```
go env -w GOPROXY=https://goproxy.cn,direct find / -name libpython3.7.so
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go get -u github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go get -u github.com/golang/protobuf/protoc-gen-go@v1.4.3
go get -u google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
``` ```
通常会有类似于`**/lib/libpython3.7.so`或者`**/lib/x86_64-linux-gnu/libpython3.7.so`出现,我们只需要取它的文件夹目录就好,比如找到`/usr/local/lib/libpython3.7.so`,那么我们只需要`export PYTHON_LIBRARIES=/usr/local/lib`就好。
如果没有找到,说明 1)静态编译Python,需要重新安装动态编译的Python 2)全县不足无法查看相关系统目录。
3) 设置`PYTHON_EXECUTABLE`
## 编译Server部分 直接查看python3.7路径
```
which python3.7
```
假如结果是`/usr/local/bin/python3.7`,那么直接设置`export PYTHON_EXECUTABLE=/usr/local/bin/python3.7`
### 集成CPU版本Paddle Inference Library 设置好这三个环境变量至关重要,设置完成后,我们便可以执行下列操作(以下是Paddle Cuda 11.2的开发镜像的PYTHON环境,如果是其他镜像,请更改相应的`PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`)。
``` shell
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON ..
make -j10
``` ```
# 以下三个环境变量是Paddle开发镜像Cuda11.2的环境,如其他镜像可能需要修改
export PYTHON_INCLUDE_DIR=/usr/include/python3.7m/
export PYTHON_LIBRARIES=/usr/lib/x86_64-linux-gnu/libpython3.7m.so
export PYTHON_EXECUTABLE=/usr/bin/python3.7
可以执行`make install`把目标产出放在`./output`目录下,cmake阶段需添加`-DCMAKE_INSTALL_PREFIX=./output`选项来指定存放路径。 export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
### 集成GPU版本Paddle Inference Library python -m install -r python/requirements.txt
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
go install google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
相比CPU环境,GPU环境需要参考以下表格, 如果您是GPU用户需要额外设置`CUDA_PATH`, `CUDNN_LIBRARY`, `CUDA_CUDART_LIBRARY``TENSORRT_LIBRARY_PATH`
**需要说明的是,以下表格对非Docker编译环境作为参考,Docker编译环境已经配置好相关参数,无需在cmake过程指定。** ```
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/"
```
环境变量的含义如下表所示。
| cmake环境变量 | 含义 | GPU环境注意事项 | Docker环境是否需要 | | cmake环境变量 | 含义 | GPU环境注意事项 | Docker环境是否需要 |
|-----------------------|-------------------------------------|-------------------------------|--------------------| |-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda安装路径,通常为/usr/local/cuda | 全部环境都需要 | 否(/usr/local/cuda) | | CUDA_TOOLKIT_ROOT_DIR | cuda安装路径,通常为/usr/local/cuda | 全部GPU环境都需要 | 否(/usr/local/cuda) |
| CUDNN_LIBRARY | libcudnn.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部环境都需要 | 否(/usr/local/cuda/lib64/) | | CUDNN_LIBRARY | libcudnn.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部GPU环境都需要 | 否(/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | libcudart.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部环境都需要 | 否(/usr/local/cuda/lib64/) | | CUDA_CUDART_LIBRARY | libcudart.so.*所在目录,通常为/usr/local/cuda/lib64/ | 全部GPU环境都需要 | 否(/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | libnvinfer.so.*所在目录的上一级目录,取决于TensorRT安装目录 | Cuda 9.0/10.0不需要,其他需要 | 否(/usr) | | TENSORRT_ROOT | libnvinfer.so.*所在目录的上一级目录,取决于TensorRT安装目录 | 全部GPU环境都需要 | 否(/usr) |
非Docker环境下,用户可以参考如下执行方式,具体的路径以当时环境为准,代码仅作为参考。TENSORRT_LIBRARY_PATH和TensorRT版本有关,要根据实际情况设置。例如在cuda10.1环境下TensorRT版本是6.0(/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/),在cuda10.2和cuda11.0环境下TensorRT版本是7.1(/usr/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/)。
``` shell
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/local/TensorRT6-cuda10.1-cudnn7/targets/x86_64-linux-gnu/"
mkdir server-build-gpu && cd server-build-gpu ## 正式编译
我们一共需要编译三个目标,分别是`paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app`,其中`paddle-serving-server`需要区分CPU或者GPU版本。如果是CPU版本请运行,
### 编译paddle-serving-server
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON \
-DWITH_GPU=OFF ..
make -j20
cd ..
```
如果是GPU版本,请运行,
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
...@@ -135,83 +192,76 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ ...@@ -135,83 +192,76 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \ -DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \ -DSERVER=ON \
-DWITH_GPU=ON .. -DWITH_GPU=ON ..
make -j10 make -j20
``` cd ..
```
执行`make install`可以把目标产出放在`./output`目录下。
### 开启WITH_OPENCV选项编译C++ Server ### 编译paddle-serving-client 和 paddle-serving-app
**注意:** 只有当您需要对Paddle Serving C++部分进行二次开发,且新增的代码依赖于OpenCV库时,您才需要这样做。
编译Serving C++ Server部分,开启WITH_OPENCV选项时,需要已安装的OpenCV库,若尚未安装,可参考本文档后面的说明编译安装OpenCV库。
以开启WITH_OPENCV选项,编译CPU版本Paddle Inference Library为例,在上述编译命令基础上,加入`DOPENCV_DIR=${OPENCV_DIR}``DWITH_OPENCV=ON`选项。 接下来,我们继续编译client和app就可以了,这两个包的编译命令在所有平台通用,不区分CPU和GPU的版本。
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir`为opencv库的安装路径。
mkdir server-build-cpu && cd server-build-cpu
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
``` ```
# 编译paddle-serving-client
**注意:** 编译成功后,需要设置`SERVING_BIN`路径,详见后面的[注意事项](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#注意事项) mkdir build_client
cd build_client
## 编译Client部分
``` shell
mkdir client-build && cd client-build
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON .. -DCLIENT=ON ..
make -j10 make -j10
``` cd ..
执行`make install`可以把目标产出放在`./output`目录下。
## 编译App部分
```bash # 编译paddle-serving-app
mkdir app-build && cd app-build mkdir build_app
cd build_app
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \ -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \ -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON .. -DAPP=ON ..
make make -j10
cd ..
``` ```
## 安装相关whl包
```
pip3.7 install -r build_server/python/dist/*.whl
pip3.7 install -r build_client/python/dist/*.whl
pip3.7 install -r build_app/python/dist/*.whl
export SERVING_BIN=${PWD}/build_server/core/general-server/serving
```
## 注意事项
## 安装wheel包 注意到上一小节的最后一行`export SERVING_BIN`,运行python端Server时,会检查`SERVING_BIN`环境变量,如果想使用自己编译的二进制文件,请将设置该环境变量为对应二进制文件的路径,通常是`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`
其中BUILD_DIR为`build_server`的绝对路径。
无论是Client端,Server端还是App部分,编译完成后,安装编译过程临时目录(`server-build-cpu``server-build-gpu``client-build``app-build`)下的`python/dist/` 中的whl包即可。 可以cd build_server路径下,执行`export SERVING_BIN=${PWD}/core/general-server/serving`
例如:cd server-build-cpu/python/dist && pip install -U xxxxx.whl
## 开启WITH_OPENCV选项编译C++ Server
## 注意事项 **注意:** 只有当您需要对Paddle Serving C++部分进行二次开发,且新增的代码依赖于OpenCV库时,您才需要这样做。
运行python端Server时,会检查`SERVING_BIN`环境变量,如果想使用自己编译的二进制文件,请将设置该环境变量为对应二进制文件的路径,通常是`export SERVING_BIN=${BUILD_DIR}/core/general-server/serving` 编译Serving C++ Server部分,开启WITH_OPENCV选项时,需要已安装的OpenCV库,若尚未安装,可参考本文档后面的说明编译安装OpenCV库。
其中BUILD_DIR为server-build-cpu或server-build-gpu的绝对路径。
可以cd server-build-cpu路径下,执行`export SERVING_BIN=${PWD}/core/general-server/serving`
以开启WITH_OPENCV选项,编译CPU版本Paddle Inference Library为例,在上述编译命令基础上,加入`DOPENCV_DIR=${OPENCV_DIR}``DWITH_OPENCV=ON`选项。
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir`为opencv库的安装路径。
mkdir build_server && cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**注意:** 编译成功后,需要设置`SERVING_BIN`路径,详见后面的[注意事项](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#注意事项)
## 如何验证
请使用 `python/examples` 下的例子进行验证。
## CMake选项说明 ## 附:CMake选项说明
| 编译选项 | 说明 | 默认 | | 编译选项 | 说明 | 默认 |
| :--------------: | :----------------------------------------: | :--: | | :--------------: | :----------------------------------------: | :--: |
...@@ -252,11 +302,11 @@ Paddle Serving通过PaddlePaddle预测库支持在GPU上做预测。WITH_GPU选 ...@@ -252,11 +302,11 @@ Paddle Serving通过PaddlePaddle预测库支持在GPU上做预测。WITH_GPU选
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 | | post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 | | post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### 如何让Paddle Serving编译系统探测到CuDNN库 ### 附:如何让Paddle Serving编译系统探测到CuDNN库
从NVIDIA developer官网下载对应版本CuDNN并在本地解压后,在cmake编译命令中增加`-DCUDNN_LIBRARY`参数,指定CuDNN库所在路径。 从NVIDIA developer官网下载对应版本CuDNN并在本地解压后,在cmake编译命令中增加`-DCUDNN_LIBRARY`参数,指定CuDNN库所在路径。
## 编译安装OpenCV库 ## 附:编译安装OpenCV库
**注意:** 只有当您需要在C++代码中引入OpenCV库时,您才需要这样做。 **注意:** 只有当您需要在C++代码中引入OpenCV库时,您才需要这样做。
* 首先需要从OpenCV官网上下载在Linux环境下源码编译的包,以OpenCV3.4.7为例,下载命令如下。 * 首先需要从OpenCV官网上下载在Linux环境下源码编译的包,以OpenCV3.4.7为例,下载命令如下。
......
# How to compile PaddleServing
([简体中文](./Compile_CN.md)|English)
## Overview
Compiling Paddle Serving is divided into the following steps
- Compilation Environment Preparation: According to the needs of the model and operating environment, select the most suitable image
- Download the Serving Code Repo: Download the Serving code library, and perform initialization operations as needed
- Environment Variable Preparation: According to the needs of the running environment, determine the various environment variables of Python. For example, the GPU environment also needs to determine the environment variables such as Cuda, Cudnn, TensorRT and so on.
- Compilation: Compile `paddle-serving-server`, `paddle-serving-client`, `paddle-serving-app` related whl packages
- Install Related Whl Packages: install the three compiled whl packages, and set the SERVING_BIN environment variable
In addition, for some C++ secondary development scenarios, we also provide OPENCV binding solutions.
## Compilation Environment Requirements
| module | version |
| :--------------------------: | :-------------------------------: |
| OS | Ubuntu16 and 18/CentOS 7 |
| gcc | 5.4.0(Cuda 10.1) and 8.2.0 |
| gcc-c++ | 5.4.0(Cuda 10.1) and 8.2.0 |
| cmake | 3.2.0 and later |
| Python | 3.6.0 and later |
| Go | 1.17.2 and later |
| git | 2.17.1 and later |
| glibc-static | 2.17 |
| openssl-devel | 1.0.2k |
| bzip2-devel | 1.0.6 and later |
| python3-devel | 3.6.0 and later |
| sqlite-devel | 3.7.17 and later |
| patchelf | 0.9 |
| libXext | 1.3.3 |
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
Docker compilation is recommended. We have prepared the Paddle Serving compilation environment for you and configured the above compilation dependencies. For details, please refer to [this document](DOCKER_IMAGES_CN.md).
We provide five environment development images, namely CPU, Cuda10.1+Cudnn7, Cuda10.2+Cudnn7, Cuda10.2+Cudnn8, Cuda11.2+Cudnn8. We provide a Serving development image to cover the above environment. At the same time, we also support Paddle development mirroring.
The Serving image name is **paddlepaddle/serving:${Serving development image Tag}** (If the network is not good, you can visit **registry.baidubce.com/paddlepaddle/serving:${Serving development image Tag}**), The name of the Paddle development image is **paddlepaddle/paddle:${Paddle Development Image Tag}**. In order to prevent users from confusing the two sets of mirroring, we explain the origin of the two sets of mirroring separately.
Serving development mirror is the mirror used to compile and debug prediction services provided by Serving suite in order to support various prediction environments. Paddle development mirror is the mirror used for compilation, development, and training models released by Paddle on the official website. In order to allow Paddle developers to use Serving directly in the same container. For developers who have already used Serving users in the previous version, Serving development image should not be unfamiliar. But for developers who are familiar with the Paddle training framework ecology, they should be more familiar with the existing Paddle development mirrors. In order to adapt to the different habits of all users, we have fully supported both sets of mirrors.
| Environment | Serving Dev Image Tag | OS | Paddle Dev Image Tag | OS |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | Nan | Nan |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | Nan | Nan |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
We first need to pull related images for the environment we need. Under the **Environment** column in the above table, except for the CPU, the rest (Cuda**+Cudnn**) belong to the GPU environment.
You can use Serving Dev Images.
```
docker pull paddlepaddle/serving:${Serving Dev Image Tag}
# For GPU Image
nvidia-docker run --rm -it paddlepaddle/serving:${Serving Dev Image Tag} bash
# For CPU Image
docker run --rm -it paddlepaddle/serving:${Serving Dev Image Tag} bash
```
You can also use Paddle Dev Images.
## Download the Serving Code Repo
**Note: If you are using Paddle to develop the image, you need to manually run `bash env_install.sh` after downloading the code base (as shown in the third line of the code box)**
```
git clone https://github.com/PaddlePaddle/Serving
cd Serving && git submodule update --init --recursive
# Paddle development image needs to run the following commands, Serving development image does not need to run
bash tools/paddle_env_install.sh
```
## Environment Variables Preparation
**Set PYTHON environment variable**
If you are using a Serving development image, please follow the steps below to determine the Python version that needs to be compiled and set the corresponding environment variables. A total of three environment variables need to be set, namely `PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES`, `PYTHON_EXECUTABLE`. Below we take python 3.7 as an example to introduce how to set these three environment variables.
1) Set `PYTHON_INCLUDE_DIR`
Search the directory where Python.h is located
```
find / -name Python.h
```
Usually there will be something like `**/include/python3.7/Python.h`, we only need to take its folder directory, for example, find `/usr/include/python3.7/Python.h`, Then we only need `export PYTHON_INCLUDE_DIR=/usr/include/python3.7/`.
If not found. Explanation 1) The development version of Python is not installed and needs to be re-installed. 2) Insufficient permissions cannot view the relevant system directories.
2) Set `PYTHON_LIBRARIES`
Search for libpython3.7.so
```
find / -name libpython3.7.so
```
Usually there will be something similar to `**/lib/libpython3.7.so` or `**/lib/x86_64-linux-gnu/libpython3.7.so`, we only need to take its folder directory, For example, find `/usr/local/lib/libpython3.7.so`, then we only need `export PYTHON_LIBRARIES=/usr/local/lib`.
If it is not found, it means 1) Statically compiling Python, you need to reinstall the dynamically compiled Python 2) The county is not enough to view the relevant system catalogs.
3) Set `PYTHON_EXECUTABLE`
View the python3.7 path directly
```
which python3.7
```
If the result is `/usr/local/bin/python3.7`, then directly set `export PYTHON_EXECUTABLE=/usr/local/bin/python3.7`.
It is very important to set these three environment variables. After the settings are completed, we can perform the following operations (the following is the PYTHON environment of the development image of Paddle Cuda 11.2, if it is another image, please change the corresponding `PYTHON_INCLUDE_DIR`, `PYTHON_LIBRARIES` , `PYTHON_EXECUTABLE`).
```
# The following three environment variables are the environment of Paddle development mirror Cuda11.2, such as other mirrors may need to be modified
export PYTHON_INCLUDE_DIR=/usr/include/python3.7m/
export PYTHON_LIBRARIES=/usr/lib/x86_64-linux-gnu/libpython3.7m.so
export PYTHON_EXECUTABLE=/usr/bin/python3.7
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
python -m install -r python/requirements.txt
go env -w GO111MODULE=on
go env -w GOPROXY=https://goproxy.cn,direct
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
go install google.golang.org/grpc@v1.33.0
go env -w GO111MODULE=auto
```
If you are a GPU user, you need to set additional `CUDA_PATH`, `CUDNN_LIBRARY`, `CUDA_CUDART_LIBRARY` and `TENSORRT_LIBRARY_PATH`.
```
export CUDA_PATH='/usr/local/cuda'
export CUDNN_LIBRARY='/usr/local/cuda/lib64/'
export CUDA_CUDART_LIBRARY="/usr/local/cuda/lib64/"
export TENSORRT_LIBRARY_PATH="/usr/"
```
The meaning of environment variables is shown in the table below.
| cmake environment variable | meaning | GPU environment considerations | whether Docker environment is needed |
|-----------------------|-------------------------------------|-------------------------------|--------------------|
| CUDA_TOOLKIT_ROOT_DIR | cuda installation path, usually /usr/local/cuda | Required for all environments | No (/usr/local/cuda) |
| CUDNN_LIBRARY | The directory where libcudnn.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| CUDA_CUDART_LIBRARY | The directory where libcudart.so.* is located, usually /usr/local/cuda/lib64/ | Required for all environments | No (/usr/local/cuda/lib64/) |
| TENSORRT_ROOT | The upper level directory of the directory where libnvinfer.so.* is located, depends on the TensorRT installation directory | Required for all environments | No (/usr) |
## Compilation
We need to compile three targets in total, namely `paddle-serving-server`, `paddle-serving-client`, and `paddle-serving-app`, among which `paddle-serving-server` needs to distinguish between CPU or GPU version. If it is a CPU version, please run,
### Compile paddle-serving-server
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DSERVER=ON \
-DWITH_GPU=OFF ..
make -j20
cd ..
```
If it is the GPU version, please run,
```
mkdir build_server
cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCUDA_TOOLKIT_ROOT_DIR=${CUDA_PATH} \
-DCUDNN_LIBRARY=${CUDNN_LIBRARY} \
-DCUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY} \
-DTENSORRT_ROOT=${TENSORRT_LIBRARY_PATH} \
-DSERVER=ON \
-DWITH_GPU=ON ..
make -j20
cd ..
```
### Compile paddle-serving-client and paddle-serving-app
Next, we can continue to compile the client and app. The compilation commands for these two packages are common on all platforms, and do not distinguish between CPU and GPU versions.
```
# Compile paddle-serving-client
mkdir build_client
cd build_client
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DCLIENT=ON ..
make -j10
cd ..
# Compile paddle-serving-app
mkdir build_app
cd build_app
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DAPP=ON ..
make -j10
cd ..
```
## Install Related Whl Packages
```
pip3.7 install -r build_server/python/dist/*.whl
pip3.7 install -r build_client/python/dist/*.whl
pip3.7 install -r build_app/python/dist/*.whl
export SERVING_BIN=${PWD}/build_server/core/general-server/serving
```
## Precautions
Note the last line `export SERVING_BIN` in the previous section. When running the python server, the `SERVING_BIN` environment variable will be checked. If you want to use the binary file compiled by yourself, please set the environment variable to the path of the corresponding binary file. It is `export SERVING_BIN=${BUILD_DIR}/core/general-server/serving`.
Where BUILD_DIR is the absolute path of `build_server`.
You can cd build_server path and execute `export SERVING_BIN=${PWD}/core/general-server/serving`
## Enable WITH_OPENCV option to compile C++ Server
**Note:** You only need to do this when you need to do secondary development on the Paddle Serving C++ part and the newly added code depends on the OpenCV library.
To compile the Serving C++ Server part, when the WITH_OPENCV option is turned on, the installed OpenCV library is required. If it has not been installed, you can refer to the instructions at the back of this document to compile and install the OpenCV library.
Take the WITH_OPENCV option and compile the CPU version Paddle Inference Library as an example. On the basis of the above compilation command, add the `DOPENCV_DIR=${OPENCV_DIR}` and `DWITH_OPENCV=ON` options.
``` shell
OPENCV_DIR=your_opencv_dir #`your_opencv_dir` is the installation path of the opencv library.
mkdir build_server && cd build_server
cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
-DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
-DOPENCV_DIR=${OPENCV_DIR} \
-DWITH_OPENCV=ON \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE_CN.md#Notes) ).
## Attached: CMake option description
| Compilation Options | Description | Default |
| :--------------: | :------------------------------- ---------: | :--: |
| WITH_AVX | Compile Paddle Serving with AVX intrinsics | OFF |
| WITH_MKL | Compile Paddle Serving with MKL support | OFF |
| WITH_GPU | Compile Paddle Serving with NVIDIA GPU | OFF |
| WITH_TRT | Compile Paddle Serving with TensorRT | OFF |
| WITH_OPENCV | Compile Paddle Serving with OPENCV | OFF |
| CUDNN_LIBRARY | Define CuDNN library and header path | |
| CUDA_TOOLKIT_ROOT_DIR | Define CUDA PATH | |
| TENSORRT_ROOT | Define TensorRT PATH | |
| CLIENT | Compile Paddle Serving Client | OFF |
| SERVER | Compile Paddle Serving Server | OFF |
| APP | Compile Paddle Serving App package | OFF |
| PACK | Compile for whl | OFF |
### WITH_GPU option
Paddle Serving supports prediction on the GPU through the PaddlePaddle prediction library. The WITH_GPU option is used to detect basic libraries such as CUDA/CUDNN on the system. If a suitable version is detected, the GPU version of the OP Kernel will be compiled when the PaddlePaddle is compiled.
To compile the Paddle Serving GPU version on bare metal, you need to install these basic libraries:
-CUDA
-CuDNN
To compile the TensorRT version, you need to install the TensorRT library.
The things to note here are:
1. Compile the basic library versions such as CUDA/CUDNN installed on the system where Serving is located, and need to be compatible with the actual GPU device. For example, Tesla V100 card requires at least CUDA 9.0. If the version of basic libraries such as CUDA used during compilation is too low, the Serving process cannot be started due to the incompatibility between the generated GPU code and the actual hardware device, or serious problems such as coredump may occur.
2. Install the CUDA driver compatible with the actual GPU device on the system running Paddle Serving, and install the basic library compatible with the CUDA/CuDNN version used during compilation. If the version of CUDA/CuDNN installed on the system running Paddle Serving is lower than the version used during compilation, it may cause strange cuda function call failures and other problems.
The following is the matching relationship between PaddleServing mirrored Cuda, Cudnn, and TensorRT for reference:
| | CUDA | CuDNN | TensorRT |
| :----: | :-----: | :----------: | :----: |
| post101 | 10.1 | CuDNN 7.6.5 | 6.0.1 |
| post102 | 10.2 | CuDNN 8.0.5 | 7.1.3 |
| post11 | 11.0 | CuDNN 8.0.4 | 7.1.3 |
### Attachment: How to make the Paddle Serving compilation system detect the CuDNN library
After downloading the corresponding version of CuDNN from the official website of NVIDIA developer and decompressing it locally, add the `-DCUDNN_LIBRARY` parameter to the cmake compilation command and specify the path of the CuDNN library.
## Attachment: Compile and install OpenCV library
**Note:** You only need to do this when you need to include the OpenCV library in your C++ code.
* First, you need to download the package compiled from the source code in the Linux environment from the OpenCV official website. Take OpenCV 3.4.7 as an example. The download command is as follows.
```
wget https://github.com/opencv/opencv/archive/3.4.7.tar.gz
tar -xf 3.4.7.tar.gz
```
Finally, you can see the folder `opencv-3.4.7/` in the current directory.
* Compile OpenCV, set the OpenCV source path (`root_path`) and installation path (`install_path`). Enter the OpenCV source code path and compile in the following way.
```shell
root_path=your_opencv_root_path
install_path=${root_path}/opencv3
rm -rf build
mkdir build
cd build
cmake .. \
-DCMAKE_INSTALL_PREFIX=${install_path} \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DWITH_IPP=OFF \
-DBUILD_IPP_IW=OFF \
-DWITH_LAPACK=OFF \
-DWITH_EIGEN=OFF \
-DCMAKE_INSTALL_LIBDIR=lib64 \
-DWITH_ZLIB=ON \
-DBUILD_ZLIB=ON \
-DWITH_JPEG=ON \
-DBUILD_JPEG=ON \
-DWITH_PNG=ON \
-DBUILD_PNG=ON \
-DWITH_TIFF=ON \
-DBUILD_TIFF=ON
make -j
make install
```
Among them, `root_path` is the downloaded OpenCV source path, `install_path` is the installation path of OpenCV, after the completion of `make install`, OpenCV header files and library files will be generated in this folder, which are used to compile the code that references the OpenCV library .
The final file structure under the installation path is as follows.
```
opencv3/
|-- bin
|-- include
|-- lib
|-- lib64
|-- share
```
...@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu ...@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu
1. Build and test 1. Build and test
Users can build Paddle Serving natively on Linux, see the [BUILD steps](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md). Users can build Paddle Serving natively on Linux, see the [BUILD steps](Compile_EN.md).
1. Keep pulling 1. Keep pulling
......
# 稀疏参数索引服务Cube单机版使用指南 # 稀疏参数索引服务Cube单机版使用指南
(简体中文|[English](./CUBE_LOCAL.md)) (简体中文|[English](./Cube_Local_EN.md))
## 引言 ## 引言
在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。 在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](CUBE_LOCAL_CN.md)(正在建设中)。--> <!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](Cube_Local_CN.md)(正在建设中)。-->
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./CUBE_QUANT_CN.md) 本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./Cube_Quant_CN.md)
## 示例 ## 示例
python/example/criteo_ctr_with_cube下执行 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/下执行
``` ```
python local_train.py # 训练模型 python local_train.py # 训练模型
cp ../../../build_server/core/predictor/seq_generator seq_generator #复制Sequence File模型生成工具 cp ../../../build_server/core/predictor/seq_generator seq_generator #复制Sequence File模型生成工具
...@@ -96,7 +96,7 @@ cd cube ...@@ -96,7 +96,7 @@ cd cube
## 注: 配置文件 ## 注: 配置文件
python/examples/criteo_ctr_with_cube/cube/conf下的cube.conf示例,此文件被上述的cube-cli所使用,单机版用户可以直接使用不用关注此部分,它在分布式部署中更为重要。 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/cube/conf下的cube.conf示例,此文件被上述的cube-cli所使用,单机版用户可以直接使用不用关注此部分,它在分布式部署中更为重要。
``` ```
[{ [{
......
# Cube: Sparse Parameter Indexing Service (Local Mode) # Cube: Sparse Parameter Indexing Service (Local Mode)
([简体中文](./CUBE_LOCAL_CN.md)|English) ([简体中文](./Cube_Local_CN.md)|English)
## Overview ## Overview
There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services. There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services.
The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos. The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos.
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) after reading this document (still developing).--> <!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) after reading this document (still developing).-->
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md)
## Example ## Example
in directory python/example/criteo_ctr_with_cube, run in directory Serving/examples/C++/PaddleRec/criteo_ctr_with_cube, run
``` ```
python local_train.py # train model python local_train.py # train model
...@@ -95,7 +95,7 @@ If you see that each key has a corresponding value output, it means that the del ...@@ -95,7 +95,7 @@ If you see that each key has a corresponding value output, it means that the del
## Appendix: Configuration ## Appendix: Configuration
the config file is cube.config located in python/examples/criteo_ctr_with_cube/cube/conf, this file is used by cube-cli.the Cube Local Mode users do not need to understand that just use it, it would be quite important in Cube Distributed Mode. the config file is cube.config located in Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/cube/conf, this file is used by cube-cli.the Cube Local Mode users do not need to understand that just use it, it would be quite important in Cube Distributed Mode.
``` ```
[{ [{
......
# Cube稀疏参数索引量化存储使用指南 # Cube稀疏参数索引量化存储使用指南
(简体中文|[English](./CUBE_QUANT.md)) (简体中文|[English](./Cube_Quant_EN.md))
## 总体概览 ## 总体概览
...@@ -9,7 +9,7 @@ ...@@ -9,7 +9,7 @@
## 前序要求 ## 前序要求
请先读取 [稀疏参数索引服务Cube单机版使用指南](./CUBE_LOCAL_CN.md) 请先读取 [稀疏参数索引服务Cube单机版使用指南](./Cube_Local_CN.md)
## 组件介绍 ## 组件介绍
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
在Serving主目录下,到criteo_ctr_with_cube目录下训练出模型 在Serving主目录下,到criteo_ctr_with_cube目录下训练出模型
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py # 生成模型 python local_train.py # 生成模型
``` ```
接下来可以使用量化和非量化两种方式去生成Sequence File用于Cube稀疏参数索引。 接下来可以使用量化和非量化两种方式去生成Sequence File用于Cube稀疏参数索引。
...@@ -34,11 +34,11 @@ seq_generator ctr_serving_model/SparseFeatFactors ./cube_model/feature 8 #量化 ...@@ -34,11 +34,11 @@ seq_generator ctr_serving_model/SparseFeatFactors ./cube_model/feature 8 #量化
## 用量化模型启动Serving ## 用量化模型启动Serving
在Serving当中,使用general_dist_kv_quant_infer op来进行预测时使用量化模型。具体详见 python/examples/criteo_ctr_with_cube/test_server_quant.py。客户端部分不需要做任何改动。 在Serving当中,使用general_dist_kv_quant_infer op来进行预测时使用量化模型。具体详见 Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/test_server_quant.py。客户端部分不需要做任何改动。
为方便用户做demo,我们给出了从0开始启动量化模型Serving。 为方便用户做demo,我们给出了从0开始启动量化模型Serving。
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py python local_train.py
cp ../../../build_server/core/predictor/seq_generator seq_generator cp ../../../build_server/core/predictor/seq_generator seq_generator
cp ../../../build_server/output/bin/cube* ./cube/ cp ../../../build_server/output/bin/cube* ./cube/
......
# Quantization Storage on Cube Sparse Parameter Indexing # Quantization Storage on Cube Sparse Parameter Indexing
([简体中文](./CUBE_QUANT_CN.md)|English) ([简体中文](./Cube_Quant_CN.md)|English)
## Overview ## Overview
...@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati ...@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati
## Precondition ## Precondition
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./CUBE_LOCAL_CN.md) Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./Cube_Local_EN.md)
## Components ## Components
...@@ -21,7 +21,7 @@ This tool is used to convert the Paddle model into a Sequence File. Here, two mo ...@@ -21,7 +21,7 @@ This tool is used to convert the Paddle model into a Sequence File. Here, two mo
In Serving Directory,train the model in the criteo_ctr_with_cube directory In Serving Directory,train the model in the criteo_ctr_with_cube directory
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py # save model python local_train.py # save model
``` ```
Next, you can use quantization and non-quantization to generate Sequence File for Cube sparse parameter indexing. Next, you can use quantization and non-quantization to generate Sequence File for Cube sparse parameter indexing.
...@@ -34,11 +34,11 @@ This command will convert the sparse parameter file SparseFeatFactors in the ctr ...@@ -34,11 +34,11 @@ This command will convert the sparse parameter file SparseFeatFactors in the ctr
## Launch Serving by Quantized Model ## Launch Serving by Quantized Model
In Serving, a quantized model is used when using general_dist_kv_quant_infer op to make predictions. See python/examples/criteo_ctr_with_cube/test_server_quant.py for details. No changes are required on the client side. In Serving, a quantized model is used when using general_dist_kv_quant_infer op to make predictions. See Serving/examples/C++/PaddleRec/criteo_ctr_with_cube/test_server_quant.py for details. No changes are required on the client side.
In order to make the demo easier for users, the following script is to train the quantized criteo ctr model and launch serving by it. In order to make the demo easier for users, the following script is to train the quantized criteo ctr model and launch serving by it.
``` ```
cd python/examples/criteo_ctr_with_cube cd Serving/examples/C++/PaddleRec/criteo_ctr_with_cube
python local_train.py python local_train.py
cp ../../../build_server/core/predictor/seq_generator seq_generator cp ../../../build_server/core/predictor/seq_generator seq_generator
cp ../../../build_server/output/bin/cube* ./cube/ cp ../../../build_server/output/bin/cube* ./cube/
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
### 背景知识 ### 背景知识
推荐系统需要大规模稀疏参数索引来帮助分布式部署,可在`python/example/criteo_ctr_with_cube`或是[PaddleRec](https://github.com/paddlepaddle/paddlerec)了解推荐模型。 推荐系统需要大规模稀疏参数索引来帮助分布式部署,可在`Serving/examples/C++/PaddleRec/criteo_ctr_with_cube`或是[PaddleRec](https://github.com/paddlepaddle/paddlerec)了解推荐模型。
稀疏参数索引的模型格式是SequenceFile,源自Hadoop生态的键值对格式文件。 稀疏参数索引的模型格式是SequenceFile,源自Hadoop生态的键值对格式文件。
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
### 预备知识 ### 预备知识
- 需要会编译Paddle Serving,参见[编译文档](./COMPILE.md) - 需要会编译Paddle Serving,参见[编译文档](./Compile_EN.md)
### 用法 ### 用法
......
# Docker 镜像 # Docker 镜像
(简体中文|[English](DOCKER_IMAGES.md)) (简体中文|[English](Docker_Images_EN.md))
该文档维护了 Paddle Serving 提供的镜像列表。 该文档维护了 Paddle Serving 提供的镜像列表。
...@@ -39,7 +39,7 @@ ...@@ -39,7 +39,7 @@
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) | | GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) | | GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) | | GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11-cudnn8-devel | [Dockerfile.cuda11-cudnn8.devel](../tools/Dockerfile.cuda11-cudnn8.devel) | | GPU (cuda11.2-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**Java镜像:** **Java镜像:**
``` ```
...@@ -80,7 +80,7 @@ registry.baidubce.com/paddlepaddle/serving:xpu-x86 # for x86 xpu user ...@@ -80,7 +80,7 @@ registry.baidubce.com/paddlepaddle/serving:xpu-x86 # for x86 xpu user
运行镜像: 运行镜像:
运行镜像比开发镜像更加轻量化, 运行镜像提供了serving的whl和bin,但为了运行期更小的镜像体积,没有提供诸如cmake这样但开发工具。 如果您想了解有关信息,请检查文档[在Kubernetes上使用Paddle Serving](PADDLE_SERVING_ON_KUBERNETES.md) 运行镜像比开发镜像更加轻量化, 运行镜像提供了serving的whl和bin,但为了运行期更小的镜像体积,没有提供诸如cmake这样但开发工具。 如果您想了解有关信息,请检查文档[在Kubernetes上使用Paddle Serving](./Run_On_Kubernetes_CN.md)
| ENV | Python Version | Tag | | ENV | Python Version | Tag |
|------------------------------------------|----------------|-----------------------------| |------------------------------------------|----------------|-----------------------------|
......
# Docker Images # Docker Images
([简体中文](DOCKER_IMAGES_CN.md)|English) ([简体中文](Docker_Images_CN.md)|English)
This document maintains a list of docker images provided by Paddle Serving. This document maintains a list of docker images provided by Paddle Serving.
...@@ -37,7 +37,7 @@ If you want to customize your Serving based on source code, use the version with ...@@ -37,7 +37,7 @@ If you want to customize your Serving based on source code, use the version with
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel(not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) | | GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | latest-cuda10.1-cudnn7-gcc54-devel(not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) | | GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | latest-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) | | GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | latest-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11-cudnn8-devel | [Dockerfile.cuda11-cudnn8.devel](../tools/Dockerfile.cuda11-cudnn8.devel) | | GPU (cuda11.2-cudnn8-tensorRT7) development | Ubuntu18 | latest-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**Java Client:** **Java Client:**
``` ```
...@@ -76,7 +76,8 @@ Develop Images: ...@@ -76,7 +76,8 @@ Develop Images:
Running Images: Running Images:
Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes.](PADDLE_SERVING_ON_KUBERNETES.md). Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes.](./Run_On_Kubernetes_CN.md).
| ENV | Python Version | Tag | | ENV | Python Version | Tag |
|------------------------------------------|----------------|-----------------------------| |------------------------------------------|----------------|-----------------------------|
......
...@@ -142,7 +142,7 @@ make: *** [all] Error 2 ...@@ -142,7 +142,7 @@ make: *** [all] Error 2
#### Q:使用过程中出现CXXABI错误。 #### Q:使用过程中出现CXXABI错误。
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./RUN_IN_DOCKER_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式 这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./Run_In_Docker_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
```bash ```bash
wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz
...@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \ ...@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
(1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15` (1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](DOCKER_IMAGES_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8). (2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh). (3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
...@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is: ...@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is:
#### Q: 目前Paddle Serving支持哪些镜像环境? #### Q: 目前Paddle Serving支持哪些镜像环境?
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/DOCKER_IMAGES.md) **A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
#### Q: python编译的GCC版本与serving的版本不匹配 #### Q: python编译的GCC版本与serving的版本不匹配
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/RUN_IN_DOCKER.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77) **A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Run_In_Docker_CN.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
#### Q: paddle-serving是否支持本地离线安装 #### Q: paddle-serving是否支持本地离线安装
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md)提前准备安装好 **A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好
#### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异 #### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。 **A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。
...@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"]) ...@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"])
#### Q: 如何使用多语言客户端 #### Q: 如何使用多语言客户端
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/JAVA_SDK_CN.md) **A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md)
#### Q: 如何在Windows下使用Paddle Serving #### Q: 如何在Windows下使用Paddle Serving
......
# gRPC接口使用介绍
- [1.与bRPC接口对比](#1与brpc接口对比)
- [1.1 服务端对比](#11-服务端对比)
- [1.2 客服端对比](#12-客服端对比)
- [1.3 其他](#13-其他)
- [2.示例:线性回归预测服务](#2示例线性回归预测服务)
- [获取数据](#获取数据)
- [开启 gRPC 服务端](#开启-grpc-服务端)
- [客户端预测](#客户端预测)
- [同步预测](#同步预测)
- [异步预测](#异步预测)
- [Batch 预测](#batch-预测)
- [通用 pb 预测](#通用-pb-预测)
- [预测超时](#预测超时)
- [List 输入](#list-输入)
- [3.更多示例](#3更多示例)
使用gRPC接口,Client端可以在Win/Linux/MacOS平台上调用不同语言。gRPC 接口实现结构如下:
![](images/grpc_impl.png)
## 1.与bRPC接口对比
#### 1.1 服务端对比
* 由于gRPC Server 端实际包含了brpc-Client端的,因此brpc-Client的初始化过程是在gRPC Server 端实现的,所以gRPC Server 端 `load_model_config` 函数添加 `client_config_path` 参数,用于指定brpc-Client初始化过程中的传输数据格式配置文件路径(`client_config_path` 参数未指定时默认为None,此时`client_config_path``load_model_config` 函数中被默认为 `<server_config_path>/serving_server_conf.prototxt`,此时brpc-Client与brpc-Server的传输数据格式配置文件相同)
```
def load_model_config(self, server_config_paths, client_config_path=None)
```
在一些例子中 bRPC Server 端与 bRPC Client 端的配置文件可能不同(如 在cube local 中,Client 端的数据先交给 cube,经过 cube 处理后再交给预测库),此时 gRPC Server 端需要手动设置 gRPC Client 端的配置`client_config_path`
#### 1.2 客服端对比
* gRPC Client 端取消 `load_client_config` 步骤:
`connect` 步骤通过 RPC 获取相应的 prototxt(从任意一个 endpoint 获取即可)。
* gRPC Client 需要通过 RPC 方式设置 timeout 时间(调用形式与 bRPC Client保持一致)
因为 bRPC Client 在 `connect` 后无法更改 timeout 时间,所以当 gRPC Server 收到变更 timeout 的调用请求时会重新创建 bRPC Client 实例以变更 bRPC Client timeout时间,同时 gRPC Client 会设置 gRPC 的 deadline 时间。
**注意,设置 timeout 接口和 Inference 接口不能同时调用(非线程安全),出于性能考虑暂时不加锁。**
* gRPC Client 端 `predict` 函数添加 `asyn``is_python` 参数:
```
def predict(self, feed, fetch, batch=True, need_variant_tag=False, asyn=False, is_python=True,log_id=0)
```
1. `asyn` 为异步调用选项。当 `asyn=True` 时为异步调用,返回 `MultiLangPredictFuture` 对象,通过 `MultiLangPredictFuture.result()` 阻塞获取预测值;当 `asyn=Fasle` 为同步调用。
2. `is_python` 为 proto 格式选项。当 `is_python=True` 时,基于 numpy bytes 格式进行数据传输,目前只适用于 Python;当 `is_python=False` 时,以普通数据格式传输,更加通用。使用 numpy bytes 格式传输耗时比普通数据格式小很多(详见 [#654](https://github.com/PaddlePaddle/Serving/pull/654))。
3. `batch`为数据是否需要进行增维处理的选项。当`batch=True`时,feed数据不需要额外的处理,维持原有维度;当`batch=False`时,会对数据进行增维度处理。例如:feed.shape原始为[2,2],当`batch=False`时,会将feed.reshape为[1,2,2]。
#### 1.3 其他
* 异常处理:当 gRPC Server 端的 bRPC Client 预测失败(返回 `None`)时,gRPC Client 端同样返回None。其他 gRPC 异常会在 Client 内部捕获,并在返回的 fetch_map 中添加一个 "status_code" 字段来区分是否预测正常(参考 timeout 样例)。
* 由于 gRPC 只支持 pick_first 和 round_robin 负载均衡策略,ABTEST 特性还未打齐。
* 系统兼容性:
* [x] CentOS
* [x] macOS
* [x] Windows
* 已经支持的客户端语言:
- Python
- Java
- Go
## 2.示例:线性回归预测服务
以下是采用gRPC实现的关于线性回归预测的一个示例,具体代码详见此[链接](../python/examples/grpc_impl_example/fit_a_line)
#### 获取数据
```shell
sh get_data.sh
```
#### 开启 gRPC 服务端
``` shell
python test_server.py uci_housing_model/
```
也可以通过下面的一行代码开启默认 gRPC 服务:
```shell
python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9393 --use_multilang
```
注:--use_multilang参数用来启用多语言客户端
### 客户端预测
#### 同步预测
``` shell
python test_sync_client.py
```
#### 异步预测
``` shell
python test_asyn_client.py
```
#### Batch 预测
``` shell
python test_batch_client.py
```
#### 预测超时
``` shell
python test_timeout_client.py
```
## 3.更多示例
详见[`python/examples/grpc_impl_example`](../python/examples/grpc_impl_example)下的示例文件。
# 如何在Paddle Serving使用Go Client # 如何在Paddle Serving使用Go Client
(简体中文|[English](./IMDB_GO_CLIENT.md)) (简体中文|[English](./Imdb_GO_Client_EN.md))
本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。 本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。
......
# How to use Go Client of Paddle Serving # How to use Go Client of Paddle Serving
([简体中文](./IMDB_GO_CLIENT_CN.md)|English) ([简体中文](./Imdb_GO_Client_CN.md)|English)
This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset. This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset.
......
# 使用Docker安装Paddle Serving
(简体中文|[English](./Install_EN.md))
**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](Run_In_Docker_CN.md)。更多镜像请查看[Docker镜像列表](Docker_Images_CN.md)
**提示-1**:本项目仅支持<mark>**Python3.6/3.7/3.8**</mark>,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。
**提示-2**:以下示例中GPU环境均为cuda10.2-cudnn7,如果您使用Python Pipeline来部署,并需要Nvidia TensorRT来优化预测性能,请参考[支持的镜像环境和说明](#4支持的镜像环境和说明)来选择其他版本。
## 1.启动开发镜像
<mark>**同时支持使用Serving镜像和Paddle镜像,1.1和1.2章节中的操作2选1即可。**</mark>
### 1.1 Serving开发镜像(CPU/GPU 2选1)
**CPU:**
```
# 启动 CPU Docker
docker pull paddlepaddle/serving:0.7.0-devel
docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# 启动 GPU Docker
docker pull paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
### 1.2 Paddle开发镜像(CPU/GPU 2选1)
**CPU:**
```
# 启动 CPU Docker
docker pull paddlepaddle/paddle:2.2.0
docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle开发镜像需要执行以下脚本增加Serving所需依赖项
bash Serving/tools/paddle_env_install.sh
```
**GPU:**
```
# 启动 GPU Docker
docker pull paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle开发镜像需要执行以下脚本增加Serving所需依赖项
bash Serving/tools/paddle_env_install.sh
```
## 2.安装Paddle Serving相关Python库
安装所需的pip依赖
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.7.0
pip3 install paddle-serving-server==0.7.0 # CPU
pip3 install paddle-serving-app==0.7.0
pip3 install paddle-serving-server-gpu==0.7.0.post102 #GPU with CUDA10.2 + TensorRT6
# 其他GPU环境需要确认环境再选择执行哪一条
pip3 install paddle-serving-server-gpu==0.7.0.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.7.0.post112 # GPU with CUDA11.2 + TensorRT8
```
您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。
如果需要使用develop分支编译的安装包,请从[最新安装包列表](./Latest_Packages_CN.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](./Compile_CN.md)
paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。
## 3.安装Paddle相关Python库
**当您使用`paddle_serving_client.convert`命令或者`Python Pipeline框架`时才需要安装。**
```
# CPU环境请执行
pip3 install paddlepaddle==2.2.0
# GPU Cuda10.2环境请执行
pip3 install paddlepaddle-gpu==2.2.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle-Inference官方文档-下载安装Linux预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)选择相应的GPU环境的url链接并进行安装。
例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``linux-cuda10.1-cudnn7.6-trt6-gcc8.2`对应的url,复制下来并执行
```
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.0.post101-cp36-cp36m-linux_x86_64.whl
```
## 4.支持的镜像环境和说明
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
# Install Paddle Serving with Docker
([简体中文](./Install_CN.md)|English)
**Strongly recommend** you build **Paddle Serving** in Docker, please check [How to run PaddleServing in Docker](Run_In_Docker_CN.md). For more images, please refer to [Docker Image List](Docker_Images_CN.md).
**Tip-1**: This project only supports <mark>**Python3.6/3.7/3.8**</mark>, all subsequent operations related to Python/Pip need to select the correct Python version.
**Tip-2**: The GPU environments in the following examples are all cuda10.2-cudnn7. If you use Python Pipeline to deploy and need Nvidia TensorRT to optimize prediction performance, please refer to [Supported Mirroring Environment and Instructions](#4.-Supported-Docker-Images-and-Instruction) to choose other versions.
## 1. Start the Docker Container
<mark>**Both Serving Dev Image and Paddle Dev Image are supported at the same time. You can choose 1 from the operation 2 in chapters 1.1 and 1.2.**</mark>
### 1.1 Serving Dev Images (CPU/GPU 2 choose 1)
**CPU:**
```
# Start CPU Docker Container
docker pull paddlepaddle/serving:0.7.0-devel
docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# Start GPU Docker Container
docker pull paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/serving:0.7.0-cuda10.2-cudnn7-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
### 1.2 Paddle Dev Images (choose any codeblock of CPU/GPU)
**CPU:**
```
# Start CPU Docker Container
docker pull paddlepaddle/paddle:2.2.0
docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle dev image needs to run the following script to increase the dependencies required by Serving
bash Serving/tools/paddle_env_install.sh
```
**GPU:**
```
# Start GPU Docker
docker pull paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit paddlepaddle/paddle:2.2.0-cuda10.2-cudnn7 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
# Paddle development image needs to execute the following script to increase the dependencies required by Serving
bash Serving/tools/paddle_env_install.sh
```
## 2. Install Paddle Serving related whl Packages
Install the required pip dependencies
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.7.0
pip3 install paddle-serving-server==0.7.0 # CPU
pip3 install paddle-serving-app==0.7.0
pip3 install paddle-serving-server-gpu==0.7.0.post102 #GPU with CUDA10.2 + TensorRT6
# Other GPU environments need to confirm the environment before choosing which one to execute
pip3 install paddle-serving-server-gpu==0.7.0.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.7.0.post112 # GPU with CUDA11.2 + TensorRT8
```
If you are in China, You may need to use a chinese mirror source (such as Tsinghua source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to the pip command) to speed up the download.
If you need to use the installation package compiled by the develop branch, please download the download address from [Latest installation package list](./Latest_Packages_CN.md), and use the `pip install` command to install. If you want to compile by yourself, please refer to [Paddle Serving Compilation Document](./Compile_CN.md).
The paddle-serving-server and paddle-serving-server-gpu installation packages support Centos 6/7, Ubuntu 16/18 and Windows 10.
The paddle-serving-client and paddle-serving-app installation packages support Linux and Windows, and paddle-serving-client only supports python3.6/3.7/3.8.
## 3. Install Paddle related Python libraries
**You only need to install it when you use the `paddle_serving_client.convert` command or the `Python Pipeline framework`. **
```
# CPU environment please execute
pip3 install paddlepaddle==2.2.0
# GPU Cuda10.2 environment please execute
pip3 install paddlepaddle-gpu==2.2.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle-Inference official document-download and install the Linux prediction library](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python) Select the URL link of the corresponding GPU environment and install it.
For example, for Python3.6 users of Cuda 10.1, please select the URL corresponding to `cp36-cp36m` and `linux-cuda10.1-cudnn7.6-trt6-gcc8.2` in the table, copy it and execute
```
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.0.post101 -cp36-cp36m-linux_x86_64.whl
```
## 4. Supported Docker Images and Instruction
| Environment | Serving Development Image Tag | Operating System | Paddle Development Image Tag | Operating System |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.7.0-devel | Ubuntu 16.04 | 2.2.0 | Ubuntu 18.04. |
| Cuda10.1+Cudnn7 | 0.7.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda10.2+Cudnn7 | 0.7.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.0-cuda10.2-cudnn7 | Ubuntu 16.04 |
| Cuda10.2+Cudnn8 | 0.7.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| Cuda11.2+Cudnn8 | 0.7.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.0-cuda11.2-cudnn8 | Ubuntu 18.04 |
For **Windows 10 users**, please refer to the document [Paddle Serving Guide for Windows Platform](Windows_Tutorial_CN.md).
# Paddle Serving Client Java SDK # Paddle Serving Client Java SDK
(简体中文|[English](JAVA_SDK.md)) (简体中文|[English](Java_SDK_EN.md))
Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。 Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。
......
# Paddle Serving Client Java SDK # Paddle Serving Client Java SDK
([简体中文](JAVA_SDK_CN.md)|English) ([简体中文](Java_SDK_CN.md)|English)
Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK. Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK.
......
# Lod字段说明 # Lod字段说明
(简体中文|[English](LOD.md)) (简体中文|[English](LOD_EN.md))
## 概念 ## 概念
......
...@@ -41,7 +41,7 @@ https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-n ...@@ -41,7 +41,7 @@ https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-n
``` ```
## Baidu Kunlun user ## Baidu Kunlun user
for kunlun user who uses arm-xpu or x86-xpu can download the wheel packages as follows. Users should use the xpu-beta docker [DOCKER IMAGES](./DOCKER_IMAGES.md) for kunlun user who uses arm-xpu or x86-xpu can download the wheel packages as follows. Users should use the xpu-beta docker [DOCKER IMAGES](./Docker_Images_CN.md)
**We only support Python 3.6 for Kunlun Users.** **We only support Python 3.6 for Kunlun Users.**
### Wheel Package Links ### Wheel Package Links
......
# Paddle Serving低精度部署 ## Paddle Serving低精度部署
(简体中文|[English](./LOW_PRECISION_DEPLOYMENT.md))
(简体中文|[English](./Low_Precision_EN.md))
低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。 低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。
## 通过PaddleSlim量化生成低精度模型 ### 通过PaddleSlim量化生成低精度模型
详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html) 详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html)
## 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署 ### 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。 首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。
``` ```
wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz
...@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"]) ...@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1)) print(fetch_map["score"].reshape(-1))
``` ```
## 参考文档 ### 参考文档
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) * PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Low-Precision Deployment for Paddle Serving ## Low-Precision Deployment for Paddle Serving
(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md))
(English|[简体中文](./Low_Precision_CN.md))
Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models. Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
## Obtain the quantized model through PaddleSlim tool ### Obtain the quantized model through PaddleSlim tool
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html). Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode ### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。 Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
``` ```
...@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0 ...@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1)) print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
``` ```
## Reference ### Reference
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) * [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html) * [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html) * [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Model Zoo # Model Zoo
([English](./Model_Zoo_EN.md)|简体中文)
本页面展示了Paddle Serving目前支持的预训练模型以及下载链接 本页面展示了Paddle Serving目前支持的预训练模型以及下载链接
若您想为Paddle Serving提供新的模型,可通过[pull request](https://github.com/PaddlePaddle/Serving/pulls)提交PR 若您想为Paddle Serving提供新的模型,可通过[pull request](https://github.com/PaddlePaddle/Serving/pulls)提交PR
特别感谢[Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain)以及[PaddleHub](https://www.paddlepaddle.org.cn/hub)为Paddle Serving提供的部分预训练模型 特别感谢[Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain)以及[PaddleHub](https://www.paddlepaddle.org.cn/hub)为Paddle Serving提供的部分预训练模型
| 模型 | 类型 | 部署方式 | 下载 | 服务端 | | 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | ---- | --- | | --- | --- | --- | ---- |
| resnet_v2_50_imagenet | PaddleClas | [单模型](../examples/PaddleClas/resnet_v2_50)</br>[多模型](../examples/pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving| | resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| mobilenet_v2_imagenet | PaddleClas | [单模型](../examples/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |C++ Serving| | mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| resnet50_vd | PaddleClas | [单模型](../examples/PaddleClas/imagenet)</br>[多模型](../examples/pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ResNet50_vd.tar) |Pipeline Serving, C++ Serving| | resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| ResNet50_vd_KL | PaddleClas | [多模型](../examples/pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |Pipeline Serving| | ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| DarkNet53 | PaddleClas | [多模型](../examples/pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |Pipeline Serving| | ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| MobileNetV1 | PaddleClas | [多模型](../examples/pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |Pipeline Serving| | ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| MobileNetV2 | PaddleClas | [多模型](../examples/pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |Pipeline Serving| | ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| MobileNetV3_large_x1_0 | PaddleClas | [多模型](../examples/pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |Pipeline Serving| | DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| ResNet50_vd_FPGM | PaddleClas | [多模型](../examples/pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |Pipeline Serving| | MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| ResNet50_vd_PACT | PaddleClas | [多模型](../examples/pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |Pipeline Serving| | MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| ResNeXt101_vd_64x4d | PaddleClas | [多模型](../examples/pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |Pipeline Serving| | MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| HRNet_W18_C | PaddleClas | [多模型](../examples/pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |Pipeline Serving| | HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| ShuffleNetV2_x1_0 | PaddleClas | [多模型](../examples/pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |Pipeline Serving| | ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
| bert_chinese_L-12_H-768_A-12 | PaddleNLP | [单模型](../examples/PaddleNLP/bert)</br>[多模型](../examples/pipeline/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |Pipeline Serving, C++ Serving| | bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| senta_bilstm | PaddleNLP | [单模型](../examples/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving| | senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| lac | PaddleNLP | [单模型](../examples/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) | C++ Serving| | lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| transformer | PaddleNLP | [多模型](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) | Pipeline Serving| | transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| criteo_ctr | PaddleRec | [单模型](../examples/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) | C++ Serving | | criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| criteo_ctr_with_cube | PaddleRec | [单模型](../examples/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |C++ Serving| | criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| wide&deep | PaddleRec | [单模型](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |C++ Serving| | wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
| blazeface | PaddleDetection | [单模型](../examples/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving| | blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
| cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [单模型](../examples/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |C++ Serving| | cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| yolov4 | PaddleDetection | [单模型](../examples/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving| | yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [单模型](../examples/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |C++ Serving| | faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
| fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [单模型](../examples/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |C++ Serving| | fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| ssd_vgg16_300_240e_voc | PaddleDetection | [单模型](../examples/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |C++ Serving | | ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| yolov3_darknet53_270e_coco | PaddleDetection | [单模型](../examples/PaddleDetection/yolov3_darknet53_270e_coco)</br>[多模型](../examples/pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |Pipeline Serving, C++ Serving | | yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [单模型](../examples/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[多模型](../examples/pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |Pipeline Serving, C++ Serving | | faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [单模型](../examples/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |C++ Serving | | ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| ppyolo_mbv3_large_coco | PaddleDetection | [多模型](../examples/pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |Pipeline Serving | | ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| ttfnet_darknet53_1x_coco | PaddleDetection | [单模型](../examples/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |C++ Serving | | ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| YOLOv3-DarkNet | PaddleDetection | [单模型](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |C++ Serving | | YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
| ocr_rec | PaddleOCR | [单模型](../examples/PaddleOCR/ocr_rec_det)</br>[多模型](../examples/pipeline/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |Pipeline Serving, C++ Serving | | ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| ocr_det | PaddleOCR | [单模型](../examples/PaddleOCR/ocr_rec_det)</br>[多模型](../examples/pipeline/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |Pipeline Serving, C++ Serving | | ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| ch_ppocr_mobile_v2.0_det | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |Pipeline Serving | | ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| ch_ppocr_server_v2.0_det | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |Pipeline Serving | | ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| ch_ppocr_mobile_v2.0_rec | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |Pipeline Serving | | ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| ch_ppocr_server_v2.0_rec | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |Pipeline Serving | | ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| ch_ppocr_mobile_v2.0 | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |Pipeline Serving | | ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| ch_ppocr_server_v2.0 | PaddleOCR | [多模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |Pipeline Serving | | ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| deeplabv3 | PaddleSeg | [单模型](../examples/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) | C++ Serving | | deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| unet | PaddleSeg | [单模型](../examples/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |C++ Serving | | unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
- 注意事项
- 多模型部署示例均在pipeline文件夹下
- 单模型采用C++ Serving,多模型采用Pipeline Serving
- 请参考 [example](../examples) 查看详情 - 请参考 [example](../examples) 查看详情
- 更多模型请参考[wholechain](https://www.paddlepaddle.org.cn/wholechain) - 更多Paddle Serving支持的部署模型请参考[wholechain](https://www.paddlepaddle.org.cn/wholechain)
- 最新模型可参考
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
- [PaddleRec](https://github.com/PaddlePaddle/PaddleRec)
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)
# Model Zoo
(English|[简体中文](./Model_Zoo_CN.md)
This page lists model archives that are pre-trained and pre-packaged, ready to be served for inference with PaddleServing.
To propose a model for inclusion, please submit [pull request](https://github.com/PaddlePaddle/Serving/pulls)
Special thanks to the [Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain) and [PaddleHub](https://www.paddlepaddle.org.cn/hub) whose Model Zoo and Model Examples were used in generating these model archives
| Model | Type | Framework | Download |
| --- | --- | --- | ---- |
| resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
| bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
| blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
| cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
| fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
| ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
- Refer [example](../examples) for more details on above models.
- Refer [wholechain](https://www.paddlepaddle.org.cn/wholechain) for more pre-trained models supported by PaddleServing
- Latest models refer
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
- [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
- [PaddleRec](https://github.com/PaddlePaddle/PaddleRec)
- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg)
- [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN)
# Performance Optimization
([简体中文](./PERFORMANCE_OPTIM_CN.md)|English)
Due to different model structures, different prediction services consume different computing resources when performing predictions. For online prediction services, models that require less computing resources will have a higher proportion of communication time cost, which is called communication-intensive service. Models that require more computing resources have a higher time cost for inference calculations, which is called computation-intensive services.
For a prediction service, the easiest way to determine the type of service is to look at the time ratio. Paddle Serving provides [Timeline tool](../python/examples/util/README_CN.md), which can intuitively display the time spent in each stage of the prediction service.
For communication-intensive prediction services, requests can be aggregated, and within a limit that can tolerate delay, multiple prediction requests can be combined into a batch for prediction.
For computation-intensive prediction services, you can use GPU prediction services instead of CPU prediction services, or increase the number of graphics cards for GPU prediction services.
Under the same conditions, the communication time of the HTTP prediction service provided by Paddle Serving is longer than that of the RPC prediction service, so for communication-intensive services, please give priority to using RPC communication.
Parameters for performance optimization:
The memory/graphic memory optimization option is enabled by default in Paddle Serving, which can reduce the memory/video memory usage and usually does not affect performance. If you need to turn it off, you can use --mem_optim_off in the command line.
r_optim can optimize the calculation graph and increase the inference speed. It is turned off by default and turned on by --ir_optim in the command line.
| Parameters | Type | Default | Description |
| ---------- | ---- | ------- | ------------------------------------------------------------ |
| mem_optim_off | - | - | Disable memory / graphic memory optimization |
| ir_optim | - | - | Enable analysis and optimization of calculation graph,including OP fusion, etc |
For the mode of using Python code to start the prediction service, the API of the above two parameters is as follows:
RPC Service
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP Service
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
# 性能优化
(简体中文|[English](./PERFORMANCE_OPTIM.md))
由于模型结构的不同,在执行预测时不同的预测服务对计算资源的消耗也不相同。对于在线的预测服务来说,对计算资源要求较少的模型,通信的时间成本占比就会较高,称为通信密集型服务,对计算资源要求较多的模型,推理计算的时间成本较高,称为计算密集型服务。对于这两种服务类型,可以根据实际需求采取不同的方式进行优化
对于一个预测服务来说,想要判断属于哪种类型,最简单的方法就是看时间占比,Paddle Serving提供了[Timeline工具](../python/examples/util/README_CN.md),可以直观的展现预测服务中各阶段的耗时。
对于通信密集型的预测服务,可以将请求进行聚合,在对延时可以容忍的限度内,将多个预测请求合并成一个batch进行预测。
对于计算密集型的预测服务,可以使用GPU预测服务代替CPU预测服务,或者增加GPU预测服务的显卡数量。
在相同条件下,Paddle Serving提供的HTTP预测服务的通信时间是大于RPC预测服务的,因此对于通信密集型的服务请优先考虑使用RPC的通信方式。
性能优化相关参数:
Paddle Serving中默认开启内存/显存优化选项,可以减少对内存/显存的占用,通常不会对性能造成影响,如果需要关闭可以在命令行启动模式中使用--mem_optim_off。
ir_optim可以优化计算图,提升推理速度,默认关闭,在命令行启动的模式中通过--ir_optim开启。
| 参数 | 类型 | 默认值 | 含义 |
| --------- | ---- | ------ | -------------------------------- |
| mem_optim_off | - | - | 关闭内存/显存优化 |
| ir_optim | - | - | 开启计算图分析优化,包括OP融合等 |
对于使用Python代码启动预测服务的模式,以上两个参数的接口如下:
RPC服务
```
from paddle_serving_server import Server
server = Server()
...
server.set_memory_optimize(mem_optim)
server.set_ir_optimize(ir_optim)
...
```
HTTP服务
```
from paddle_serving_server import WebService
class NewService(WebService):
...
new_service = NewService(name="new")
...
new_service.prepare_server(mem_optim=True, ir_optim=False)
...
```
# Pipeline Serving 性能优化
([English](./Performance_Tuning_EN.md)|简体中文)
## 1. 性能分析与优化
### 1.1 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 1.2 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制:
```yaml
dag:
use_profile: true
```
开启该功能后,Server 端在预测的过程中会将对应的日志信息打印到标准输出,为了更直观地展现各阶段的耗时,提供 Analyst 模块对日志文件做进一步的分析处理。
使用时先将 Server 的输出保存到文件,以 `profile.txt` 为例,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件,`trace` 文件可以通过 chrome 浏览器的 tracing 功能进行可视化。
```python
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 1.3 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 1.4 分析方法
根据pipeline.tracer日志中的各个阶段耗时,按以下公式逐步分析出主要耗时在哪个阶段。
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
Channel堆积:
channel_acc_size = QPS(down - up) * time
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 优化思路
根据长耗时在不同阶段,采用不同的优化方法.
- OP推理阶段(mid-process):
- 增加OP并发度
- 开启auto-batching(前提是多个请求的shape一致)
- 若批量数据中某条数据的shape很大,padding很大导致推理很慢,可使用mini-batch
- 开启TensorRT/MKL-DNN优化
- 开启低精度推理
- OP前处理阶段(pre-process):
- 增加OP并发度
- 优化前处理逻辑
- in/out耗时长(channel堆积>5)
- 检查channel传递的数据大小和延迟
- 优化传入数据,不传递数据或压缩后再传入
- 增加OP并发度
- 减少上游OP并发度
# Pipeline Serving Performance Optimization
(English|[简体中文](./Performance_Tuning_CN.md))
## 1. Performance analysis and optimization
### 1.1 How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### 1.2 Output profile information on server side
The server is controlled by the `use_profile` field in yaml:
```yaml
dag:
use_profile: true
```
After the function is enabled, the server will print the corresponding log information to the standard output in the process of prediction. In order to show the time consumption of each stage more intuitively, Analyst module is provided for further analysis and processing of log files.
The output of the server is first saved to a file. Taking `profile.txt` as an example, the script converts the time monitoring information in the log into JSON format and saves it to the `trace` file. The `trace` file can be visualized through the tracing function of Chrome browser.
```shell
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### 1.3 Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 1.4 Analytical methods
According to the time consumption of each stage in the pipeline.tracer log, the following formula is used to gradually analyze which stage is the main time consumption.
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 Optimization ideas
According to the long time consuming in stages below, different optimization methods are adopted.
- OP Inference stage(mid-process):
- Increase `concurrency`
- Turn on `auto-batching`(Ensure that the shapes of multiple requests are consistent)
- Use `mini-batch`, If the shape of data is very large.
- Turn on TensorRT for GPU
- Turn on MKLDNN for CPU
- Turn on low precison inference
- OP preprocess or postprocess stage:
- Increase `concurrency`
- Optimize processing logic
- In/Out stage(channel accumulation > 5):
- Check the size and delay of the data passed by the channel
- Optimize the channel to transmit data, do not transmit data or compress it before passing it in
- Increase `concurrency`
- Decrease `concurrency` upstreams.
# Pipeline Serving # Pipeline Serving
(简体中文|[English](PIPELINE_SERVING.md)) (简体中文|[English](Pipeline_Design_EN.md))
- [架构设计](PIPELINE_SERVING_CN.md#1架构设计) - [架构设计](Pipeline_Design_CN.md#1架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2详细设计) - [详细设计](Pipeline_Design_CN.md#2详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3典型示例) - [典型示例](Pipeline_Design_CN.md#3典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法) - [高阶用法](Pipeline_Design_CN.md#4高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪) - [日志追踪](Pipeline_Design_CN.md#5日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化) - [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。 在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
...@@ -316,14 +316,14 @@ class ResponseOp(Op): ...@@ -316,14 +316,14 @@ class ResponseOp(Op):
*** ***
## 3.典型示例 ## 3.典型示例
所有Pipeline示例在[examples/pipeline/](../python/examples/pipeline) 目录下,目前有7种类型模型示例: 所有Pipeline示例在[examples/Pipeline/](../../examples/Pipeline) 目录下,目前有7种类型模型示例:
- [PaddleClas](../python/examples/pipeline/PaddleClas) - [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../python/examples/pipeline/PaddleDetection) - [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../python/examples/pipeline/bert) - [bert](../../examples/Pipeline/bert)
- [imagenet](../python/examples/pipeline/imagenet) - [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../python/examples/pipeline/imdb_model_ensemble) - [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../python/examples/pipeline/ocr) - [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../python/examples/pipeline/simple_web_service) - [simple_web_service](../../examples/Pipeline/simple_web_service)
以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示: 以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `python/examples/pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
......
# Pipeline Serving # Pipeline Serving
([简体中文](PIPELINE_SERVING_CN.md)|English) ([简体中文](Pipeline_Design_CN.md)|English)
- [Architecture Design](PIPELINE_SERVING.md#1architecture-design) - [Architecture Design](Pipeline_Design_EN.md#1architecture-design)
- [Detailed Design](PIPELINE_SERVING.md#2detailed-design) - [Detailed Design](Pipeline_Design_EN.md#2detailed-design)
- [Classic Examples](PIPELINE_SERVING.md#3classic-examples) - [Classic Examples](Pipeline_Design_EN.md#3classic-examples)
- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages) - [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages)
- [Log Tracing](PIPELINE_SERVING.md#5log-tracing) - [Log Tracing](Pipeline_Design_EN.md#5log-tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization) - [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low. In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
...@@ -311,14 +311,14 @@ The default implementation of **pack_response_package** is to convert the dictio ...@@ -311,14 +311,14 @@ The default implementation of **pack_response_package** is to convert the dictio
## 3.Classic Examples ## 3.Classic Examples
All examples of pipelines are in [examples/pipeline/](../python/examples/pipeline) directory, There are 7 types of model examples currently: All examples of pipelines are in [examples/pipeline/](../../examples/Pipeline) directory, There are 7 types of model examples currently:
- [PaddleClas](../python/examples/pipeline/PaddleClas) - [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../python/examples/pipeline/PaddleDetection) - [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../python/examples/pipeline/bert) - [bert](../../examples/Pipeline/bert)
- [imagenet](../python/examples/pipeline/imagenet) - [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../python/examples/pipeline/imdb_model_ensemble) - [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../python/examples/pipeline/ocr) - [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../python/examples/pipeline/simple_web_service) - [simple_web_service](../../examples/Pipeline/simple_web_service)
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure: Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
......
## Paddle Serving 快速开始示例
([English](./Quick_Start_EN.md)|简体中文)
这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。
<h3 align="center">波士顿房价预测</h3>
进入到Serving的git目录下,进入到`fit_a_line`例子
``` shell
cd Serving/examples/C++/fit_a_line
sh get_data.sh
```
Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
<h3 align="center">RPC服务</h3>
用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 异步模型的说明
异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### 当您的一个服务包含两个模型部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8
</center>
``` python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict``fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"``"price"`
<h3 align="center">HTTP服务</h3>
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`Serving/examples/C++/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
客户端输入
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
返回结果
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline服务</h3>
Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](../examples/Pipeline/PaddleOCR/ocr/),在目录`python/examples/pipeline/ocr`
我们先获取两个模型
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
然后启动服务端程序,将两个串联的模型作为一个整体的服务。
```
python3 web_service.py
```
最终使用http的方式请求
```
python3 pipeline_http_client.py
```
也支持rpc的方式
```
python3 pipeline_rpc_client.py
```
输出
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
## Paddle Serving Quick Start Examples
(English|[简体中文](./Quick_Start_CN.md))
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
### Boston House Price Prediction model
get into the Serving git directory, and change dir to `fit_a_line`
``` shell
cd Serving/examples/C++/fit_a_line
sh get_data.sh
```
Paddle Serving provides HTTP and RPC based service for users to access
### RPC service
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `4` | Concurrency of current service |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str | `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Only for deployment with TensorRT |
</center>
```python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
import numpy as np
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training.
### WEB service
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `Serving/examples/C++/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
for client side,
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
the response is
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline Service</h3>
Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](../examples/Pipeline/PaddleOCR/ocr/).
we get two models
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
then we start server side, launch two models as one standalone web service
```
python3 web_service.py
```
http request
```
python3 pipeline_http_client.py
```
grpc request
```
python3 pipeline_rpc_client.py
```
output
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
# 如何在Docker中运行PaddleServing # 如何在Docker中运行PaddleServing
(简体中文|[English](RUN_IN_DOCKER.md)) (简体中文|[English](Run_In_Docker_EN.md))
Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。 Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。
...@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker) ...@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker)
### 获取镜像 ### 获取镜像
参考[该文档](DOCKER_IMAGES_CN.md)获取镜像: 参考[该文档](Docker_Images_CN.md)获取镜像:
以CPU编译镜像为例 以CPU编译镜像为例
...@@ -59,9 +59,9 @@ docker exec -it test bash ...@@ -59,9 +59,9 @@ docker exec -it test bash
### 安装PaddleServing ### 安装PaddleServing
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](LATEST_PACKAGES.md) 请参照首页的指导,下载对应版本的pip包。[最新安装包合集](Latest_Packages_CN.md)
## 注意事项 ## 注意事项
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](COMPILE.md) - 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](Compile_CN.md)
- 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。 - 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。
# How to run PaddleServing in Docker # How to run PaddleServing in Docker
([简体中文](RUN_IN_DOCKER_CN.md)|English) ([简体中文](Run_In_Docker_CN.md)|English)
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms. One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
...@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d ...@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d
### Get docker image ### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image: Refer to [this document](Docker_Images_EN.md) for a docker image:
```shell ```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel
...@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe ...@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe
### Get docker image ### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image: Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
```shell ```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
...@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of ...@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services. The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md) If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious ## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md). - Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11. - If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
...@@ -47,12 +47,12 @@ bash tools/generate_runtime_docker.sh --help ...@@ -47,12 +47,12 @@ bash tools/generate_runtime_docker.sh --help
#### Pipeline模式: #### Pipeline模式:
对于pipeline模式,我们需要确保模型和程序文件、配置文件等各种依赖都能够在镜像中运行。因此可以在`/home/project`下存放我们的执行文件时,我们以`Serving/python/example/pipeline/ocr`为例,这是OCR文字识别任务。 对于pipeline模式,我们需要确保模型和程序文件、配置文件等各种依赖都能够在镜像中运行。因此可以在`/home/project`下存放我们的执行文件时,我们以`Serving/examples/Pipeline/PaddleOCR/ocr`为例,这是OCR文字识别任务。
```bash ```bash
#假设您已经拥有Serving运行镜像,假设镜像名为paddle_serving:cuda10.2-py36 #假设您已经拥有Serving运行镜像,假设镜像名为paddle_serving:cuda10.2-py36
docker run --rm -dit --name pipeline_serving_demo paddle_serving:cuda10.2-py36 bash docker run --rm -dit --name pipeline_serving_demo paddle_serving:cuda10.2-py36 bash
cd Serving/python/example/pipeline/ocr cd Serving/examples/Pipeline/PaddleOCR/ocr
# get models # get models
python -m paddle_serving_app.package --get_model ocr_rec python -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz tar -xzvf ocr_rec.tar.gz
...@@ -80,12 +80,12 @@ python3.6 web_service.py ...@@ -80,12 +80,12 @@ python3.6 web_service.py
#### WebService模式: #### WebService模式:
web service模式本质上和pipeline模式类似,因此我们以`Serving/python/examples/bert`为例 web service模式本质上和pipeline模式类似,因此我们以`Serving/examples/C++/PaddleNLP/bert`为例
```bash ```bash
#假设您已经拥有Serving运行镜像,假设镜像名为registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-py36 #假设您已经拥有Serving运行镜像,假设镜像名为registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-py36
docker run --rm -dit --name webservice_serving_demo registry.baidubce.com/paddlepaddle/serving:0.6.0-cpu-py36 bash docker run --rm -dit --name webservice_serving_demo registry.baidubce.com/paddlepaddle/serving:0.6.0-cpu-py36 bash
cd Serving/python/examples/bert cd Serving/examples/C++/PaddleNLP/bert
### download model ### download model
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
......
# Paddle Serving使用百度昆仑芯片部署 ## Paddle Serving使用百度昆仑芯片部署
(简体中文|[English](./BAIDU_KUNLUN_XPU_SERVING.md)) (简体中文|[English](./Run_On_XPU_EN.md))
Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器(如飞腾 FT-2000+/64), 或者百度昆仑芯片和Intel CPU服务器,上进行部署,后续完善对其他异构硬件服务器部署能力。 Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器(如飞腾 FT-2000+/64), 或者百度昆仑芯片和Intel CPU服务器,上进行部署,后续完善对其他异构硬件服务器部署能力。
# 编译、安装 ## 编译、安装
基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。 基本环境配置可参考[该文档](Compile_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
## 编译 ### 编译
* 编译server部分 * 编译server部分
``` ```
cd Serving cd Serving
...@@ -50,23 +50,23 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \ ...@@ -50,23 +50,23 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10 make -j10
``` ```
## 安装wheel包 ### 安装wheel包
以上编译步骤完成后,会在各自编译目录$build_dir/python/dist生成whl包,分别安装即可。例如server步骤,会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。 以上编译步骤完成后,会在各自编译目录$build_dir/python/dist生成whl包,分别安装即可。例如server步骤,会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
# 请求参数说明 ## 请求参数说明
为了支持arm+xpu服务部署,使用Paddle-Lite加速能力,请求时需使用以下参数。 为了支持arm+xpu服务部署,使用Paddle-Lite加速能力,请求时需使用以下参数。
| 参数 | 参数说明 | 备注 | | 参数 | 参数说明 | 备注 |
| :------- | :-------------------------- | :--------------------------------------------------------------- | | :------- | :-------------------------- | :--------------------------------------------------------------- |
| use_lite | 使用Paddle-Lite Engine | 使用Paddle-Lite cpu预测能力 | | use_lite | 使用Paddle-Lite Engine | 使用Paddle-Lite cpu预测能力 |
| use_xpu | 使用Baidu Kunlun进行预测 | 该选项需要与use_lite配合使用 | | use_xpu | 使用Baidu Kunlun进行预测 | 该选项需要与use_lite配合使用 |
| ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) | | ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# 部署使用示例 ## 部署使用示例
## 下载模型 ### 下载模型
``` ```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz tar -xzf uci_housing.tar.gz
``` ```
## 启动rpc服务 ### 启动rpc服务
主要有三种启动配置: 主要有三种启动配置:
* 使用cpu+xpu部署,使用Paddle-Lite xpu优化加速能力; * 使用cpu+xpu部署,使用Paddle-Lite xpu优化加速能力;
* 单独使用cpu部署,使用Paddle-Lite优化加速能力; * 单独使用cpu部署,使用Paddle-Lite优化加速能力;
...@@ -86,7 +86,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --po ...@@ -86,7 +86,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --po
``` ```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
``` ```
## client调用 ### client调用
``` ```
from paddle_serving_client import Client from paddle_serving_client import Client
import numpy as np import numpy as np
...@@ -98,16 +98,17 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, ...@@ -98,16 +98,17 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map) print(fetch_map)
``` ```
# 其他说明 ## 其他说明
## 模型实例及说明 ### 模型实例及说明
以下提供部分样例,其他模型可参照进行修改。 以下提供部分样例,其他模型可参照进行修改。
| 示例名称 | 示例链接 | | 示例名称 | 示例链接 |
| :--------- | :---------------------------------------------------------- | | :--------- | :---------------------------------------------------------- |
| fit_a_line | [fit_a_line_xpu](../python/examples/xpu/fit_a_line_xpu) | | fit_a_line | [fit_a_line_xpu](../examples/C++/xpu/resnet_v2_50_xpu) |
| resnet | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) | | resnet | [resnet_v2_50_xpu](../examples/C++/xpu/resnet_v2_50_xpu) |
注:支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异,可能存在不支持的情况,部署使用存在问题时,欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues),我们会实时跟进。 注:支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异,可能存在不支持的情况,部署使用存在问题时,欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues),我们会实时跟进。
## 昆仑芯片支持相关参考资料
### 昆仑芯片支持相关参考资料
* [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html) * [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html) * [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# Paddle Serving Using Baidu Kunlun Chips ## Paddle Serving Using Baidu Kunlun Chips
(English|[简体中文](./BAIDU_KUNLUN_XPU_SERVING_CN.md))
(English|[简体中文](./Run_On_XPU_CN.md))
Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips
(such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve (such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve
the deployment capability on various heterogeneous hardware servers in the future. the deployment capability on various heterogeneous hardware servers in the future.
# Compilation and installation ## Compilation and installation
Refer to [compile](COMPILE.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform. Refer to [compile](./Compile_EN.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
## Compilatiton ### Compilatiton
* Compile the Serving Server * Compile the Serving Server
``` ```
cd Serving cd Serving
...@@ -52,11 +53,11 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \ ...@@ -52,11 +53,11 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10 make -j10
``` ```
## Install the wheel package ### Install the wheel package
After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories. After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories.
For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package. For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package.
# Request parameters description ## Request parameters description
In order to deploy serving In order to deploy serving
service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment. service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment.
| param | param description | about | | param | param description | about |
...@@ -64,13 +65,13 @@ In order to deploy serving ...@@ -64,13 +65,13 @@ In order to deploy serving
| use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite | | use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite |
| use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option | | use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option |
| ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) | | ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# Deplyment examples ## Deplyment examples
## Download the model ### Download the model
``` ```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz tar -xzf uci_housing.tar.gz
``` ```
## Start RPC service ### Start RPC service
There are mainly three deployment methods: There are mainly three deployment methods:
* deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu; * deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu;
* deploy on the cpu server standalone with Paddle-Lite; * deploy on the cpu server standalone with Paddle-Lite;
...@@ -90,7 +91,7 @@ Start the rpc service, deploying on cpu server. ...@@ -90,7 +91,7 @@ Start the rpc service, deploying on cpu server.
``` ```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
``` ```
## ###
``` ```
from paddle_serving_client import Client from paddle_serving_client import Client
import numpy as np import numpy as np
...@@ -102,17 +103,17 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, ...@@ -102,17 +103,17 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"]) fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map) print(fetch_map)
``` ```
# Others ## Others
## Model example and explanation ### Model example and explanation
Some examples are provided below, and other models can be modifed with reference to these examples. Some examples are provided below, and other models can be modifed with reference to these examples.
| sample name | sample links | | sample name | sample links |
| :---------- | :---------------------------------------------------------- | | :---------- | :---------------------------------------------------------- |
| fit_a_line | [fit_a_line_xpu](../python/examples/xpu/fit_a_line_xpu) | | fit_a_line | [fit_a_line_xpu](../examples/C++/xpu/fit_a_line_xpu) |
| resnet | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) | | resnet | [resnet_v2_50_xpu](../examples/C++/xpu/resnet_v2_50_xpu) |
Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time. Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time.
## Kunlun chip related reference materials ### Kunlun chip related reference materials
* [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html) * [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html) * [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# Serving Side Configuration
Paddle Serving配置文件格式采用明文格式的protobuf文件,配置文件的每个字段都需要事先在configure/proto/目录下相关.proto定义中定义好,才能被protobuf读取和解析到。
Serving端的所有配置均在configure/proto/server_configure.proto文件中。
## 1. service.prototxt
Serving端service 配置的入口是service.prototxt,用于配置Paddle Serving实例挂载的service列表。他的protobuf格式可参考`configure/server_configure.protobuf``InferServiceConf`类型。(至于具体的磁盘文件路径可通过--inferservice_path与--inferservice_file 命令行选项修改),样例如下:
```JSON
port: 8010
services {
name: "ImageClassifyService"
workflows: "workflow1"
}
```
其中
- port: 该字段标明本机serving实例启动的监听端口。默认为8010。还可以通过--port=8010命令行参数指定。
- services: 可以配置多个services。Paddle Serving被设计为单个Serving实例可以同时承载多个预测服务,服务间通过service name进行区分。例如以下代码配置2个预测服务:
```JSON
port: 8010
services {
name: "ImageClassifyService"
workflows: "workflow1"
}
services {
name: "BuiltinEchoService"
workflows: "workflow2"
}
```
- service.name: 请填写serving/proto/xx.proto文件的service名称,例如,在serving/proto/image_class.proto中,service名称如下声明:
```JSON
service ImageClassifyService {
rpc inference(Request) returns (Response);
rpc debug(Request) returns (Response);
option (pds.options).generate_impl = true;
};
```
则service name就是`ImageClassifyService`
- service.workflows: 用于指定该service下所配的workflow列表。可以配置多个workflow。在本例中,为`ImageClassifyService`配置了一个workflow:`workflow1``workflow1`的具体定义在workflow.prototxt
## 2. workflow.prototxt
workflow.prototxt用来描述每一个具体的workflow,他的protobuf格式可参考`configure/server_configure.protobuf``Workflow`类型。具体的磁盘文件路径可通过`--workflow_path``--workflow_file`指定。一个例子如下:
```JSON
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "image_reader_op"
type: "ReaderOp"
}
nodes {
name: "image_classify_op"
type: "ClassifyOp"
dependencies {
name: "image_reader_op"
mode: "RO"
}
}
nodes {
name: "write_json_op"
type: "WriteJsonOp"
dependencies {
name: "image_classify_op"
mode: "RO"
}
}
}
workflows {
name: "workflow2"
workflow_type: "Sequence"
nodes {
name: "echo_op"
type: "CommonEchoOp"
}
}
```
以上样例配置了2个workflow:`workflow1``workflow2`。以`workflow1`为例:
- name: workflow名称,用于从service.prototxt索引到具体的workflow
- workflow_type: 可选"Sequence", "Parallel",表示本workflow下节点所代表的OP是否可并行。**当前只支持Sequence类型,如配置了Parallel类型,则该workflow不会被执行**
- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 随意,建议取一个能代表当前node所执行OP的类
- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
- node.dependencies.mode: RO-Read Only, RW-Read Write
# 3. resource.prototxt
Serving端resource配置的入口是resource.prototxt,用于配置模型信息。它的protobuf格式参考`configure/proto/server_configure.proto`的ResourceConf。具体的磁盘文件路径可用`--resource_path``--resource_file`指定。样例如下:
```JSON
model_toolkit_path: "./conf"
model_toolkit_file: "model_toolkit.prototxt"
cube_config_file: "./conf/cube.conf"
```
其中:
- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- cube_config_file: 用来指定cube配置文件所在路径与文件名
Cube是Paddle Serving中用于大规模稀疏参数的组件。
# 4. model_toolkit.prototxt
用来配置模型信息和所用的预测引擎。它的protobuf格式参考`configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。样例如下:
```JSON
engines {
name: "image_classification_resnet"
type: "FLUID_CPU_NATIVE_DIR"
reloadable_meta: "./data/model/paddle/fluid_time_file"
reloadable_type: "timestamp_ne"
model_data_path: "./data/model/paddle/fluid/SE_ResNeXt50_32x4d"
runtime_thread_num: 0
batch_infer_size: 0
enable_batch_align: 0
sparse_param_service_type: LOCAL
sparse_param_service_table_name: "local_kv"
enable_memory_optimization: true
static_optimization: false
force_update_static_cache: false
}
```
其中
- name: 模型名称。InferManager通过此名称,找到要使用的模型和预测引擎。可参考serving/op/classify_op.h与serving/op/classify_op.cpp的InferManager::instance().infer()方法的参数来了解。
- type: 预测引擎的类型。可在inferencer-fluid-cpu/src/fluid_cpu_engine.cpp找到当前注册的预测引擎列表
|预测引擎|含义|
|--------|----|
|FLUID_CPU_ANALYSIS|使用fluid Analysis API;模型所有参数保存在一个文件|
|FLUID_CPU_ANALYSIS_DIR|使用fluid Analysis API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_CPU_NATIVE|使用fluid Native API;模型所有参数保存在一个文件|
|FLUID_CPU_NATIVE_DIR|使用fluid Native API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_GPU_ANALYSIS|GPU预测,使用fluid Analysis API;模型所有参数保存在一个文件|
|FLUID_GPU_ANALYSIS_DIR|GPU预测,使用fluid Analysis API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
|FLUID_GPU_NATIVE|GPU预测,使用fluid Native API;模型所有参数保存在一个文件|
|FLUID_GPU_NATIVE_DIR|GPU预测,使用fluid Native API;模型所有参数分开保存为独立的文件,整个模型放到一个目录中|
**fluid Analysis API和fluid Native API的区别**
Analysis API在模型加载过程中,会对模型计算逻辑进行多种优化,包括但不限于zero copy tensor,相邻OP的fuse等。**但优化逻辑不是一定对所有模型都有加速作用,有时甚至会有反作用,请以实测结果为准**
- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|含义|
|---------------|----|
|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
|md5sum|目前无用,配置后永远不reload|
|revision|目前无用,配置后用于不reload|
- model_data_path: 模型文件路径
- runtime_thread_num: 若大于0, 则启用bsf多线程调度框架,在每个预测bthread worker内启动多线程预测。要注意的是,当启用worker内多线程预测,workflow中OP需要用Serving框架的BatchTensor类做预测的输入和输出 (predictor/framework/infer_data.h, `class BatchTensor`)。
- batch_infer_size: 启用bsf多线程预测时,每个预测线程的batch size
- enable_batch_align:
- sparse_param_service_type: 枚举类型,可选参数,大规模稀疏参数服务类型
|sparse_param_service_type|含义|
|-------------------------|--|
|NONE|不使用大规模稀疏参数服务|
|LOCAL|单机本地大规模稀疏参数服务,以rocksdb作为引擎|
|REMOTE|分布式大规模稀疏参数服务,以Cube作为引擎|
- sparse_param_service_table_name: 可选参数,大规模稀疏参数服务承载本模型所用参数的表名。
- enable_memory_optimization: bool类型,可选参数,是否启用内存优化。只在使用fluid Analysis预测API时有意义。需要说明的是,在GPU预测时,会执行显存优化
- static_optimization: bool类型,是否执行静态优化。只有当启用内存优化时有意义。
- force_update_static_cache: bool类型,是否强制更新静态优化cache。只有当启用内存优化时有意义。
## 5. 命令行配置参数
以下是serving端支持的gflag配置选项列表,并提供了默认值。
| name | 默认值 | 含义 |
|------|--------|------|
|workflow_path|./conf|workflow配置目录名|
|workflow_file|workflow.prototxt|workflow配置文件名|
|inferservice_path|./conf|service配置目录名|
|inferservice_file|service.prototxt|service配置文件名|
|resource_path|./conf|资源管理器目录名|
|resource_file|resource.prototxt|资源管理器文件名|
|reload_interval_s|10|重载线程间隔时间(s)|
|enable_model_toolkit|true|模型管理|
|enable_protocol_list|baidu_std|brpc 通信协议列表|
|log_dir|./log|log dir|
|num_threads||brpc server使用的系统线程数,默认为CPU核数|
|port|8010|Serving进程接收请求监听端口|
|gpuid|0|GPU预测时指定Serving进程使用的GPU device id。只允许绑定1张GPU卡|
|bthread_concurrency|9|BRPC底层bthread的concurrency。在使用GPU预测引擎时,为了限制并发worker数,可使用此参数|
|bthread_min_concurrency|4|BRPC底层bthread的min concurrency。在使用GPU预测引擎时,为限制并发worker数,可使用此参数。与bthread_concurrency结合使用|
可以通过在serving/conf/gflags.conf覆盖默认值,例如
```
--log_dir=./serving_log/
```
将指定日志目录到./serving_log目录下
### 5.1 gflags.conf
可以将命令行配置参数写到配置文件中,该文件路径默认为`conf/gflags.conf`。如果`conf/gflags.conf`存在,则serving端会尝试解析其中的gflags命令。例如
```shell
--enable_model_toolkit
--port=8011
```
可用以下命令指定另外的命令行参数配置文件
```shell
bin/serving --g=true --flagfile=conf/gflags.conf.new
```
# 怎样保存用于Paddle Serving的模型? # 怎样保存用于Paddle Serving的模型?
(简体中文|[English](./SAVE.md)) (简体中文|[English](./Save_EN.md))
## 从已保存的模型文件中导出 ## 从已保存的模型文件中导出
如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。 如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
......
# How to save a servable model of Paddle Serving? # How to save a servable model of Paddle Serving?
([简体中文](./SAVE_CN.md)|English) ([简体中文](./Save_CN.md)|English)
## Export from saved model files ## Export from saved model files
......
...@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36 ...@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
### Step 1:启动Serving服务 ### Step 1:启动Serving服务
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./PADDLE_SERVING_ON_KUBERNETES.md) 我们仍然以 [Uci房价预测](../examples/C++/fit_a_line/)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes_CN.md)
在这里我们直接执行 在这里我们直接执行
``` ```
......
# Serving Configuration
(简体中文|[English](./Serving_Configure_EN.md))
## 简介
本文主要介绍C++ Serving以及Python Pipeline的各项配置:
- [模型配置文件](#模型配置文件): 转换模型时自动生成,描述模型输入输出信息
- [C++ Serving](#c-serving): 用于高性能场景,介绍了快速启动以及自定义配置方法
- [Python Pipeline](#python-pipeline): 用于单算子多模型组合场景
## 模型配置文件
在开始介绍Server配置之前,先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时,会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt,两者内容一致,为模型输入输出的参数信息,方便用户拼装参数。该配置文件用于Server以及Client,并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](./Save_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`
样例如下:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "concat_1.tmp_0"
alias_name: "concat_1.tmp_0"
is_lod_tensor: false
fetch_type: 1
shape: 3
shape: 640
shape: 640
}
```
其中
- feed_var:模型输入
- fetch_var:模型输出
- name:名称
- alias_name:别名,与名称对应
- is_lod_tensor:是否为lod,具体可参考《[Lod字段说明](./LOD_CN.md)
- feed_type:数据类型
|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:数据维度
## C++ Serving
### 1.快速启动
可以通过配置模型及端口号快速启动服务,启动命令如下:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
该命令会自动生成配置文件,并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录,其结构如下
```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```
更多启动参数详见下表:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 当您的某个模型想使用多张GPU卡部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
#### 当您的一个服务包含两个模型部署时.
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
### 2.自定义配置启动
一般情况下,自动生成的配置可以应对大部分场景。对于特殊场景,用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:
```BASH
/bin/serving --flagfile=proj.conf
```
#### 2.1 proj.conf
proj.conf用于传入服务参数,并指定了其他相关配置文件的路径。如果重复传入参数,则以最后序参数值为准。
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
各项参数的描述及默认值详见下表:
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|
#### 2.2 service.prototxt
service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path``--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf``InferServiceConf`。示例如下:
```
port: 8010
services {
name: "GeneralModelService"
workflows: "workflow1"
}
```
其中:
- port: 用于配置Serving实例监听的端口号。
- services: 使用默认配置即可,不可修改。name指定service名称,workflow1的具体定义在workflow.prototxt
#### 2.3 workflow.prototxt
workflow.prototxt用来描述具体的workflow。通过`--workflow_path``--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf``Workflow`类型。
如下示例,workflow由3个OP构成,GeneralReaderOp用于读取数据,GeneralInferOp依赖于GeneralReaderOp并进行预测,GeneralResponseOp将预测结果返回:
```
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "general_reader_0"
type: "GeneralReaderOp"
}
nodes {
name: "general_infer_0"
type: "GeneralInferOp"
dependencies {
name: "general_reader_0"
mode: "RO"
}
}
nodes {
name: "general_response_0"
type: "GeneralResponseOp"
dependencies {
name: "general_infer_0"
mode: "RO"
}
}
}
```
其中:
- name: workflow名称,用于从service.prototxt索引到具体的workflow
- workflow_type: 只支持"Sequence"
- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来
- node.name: 与node.type一一对应,具体可参考`python/paddle_serving_server/dag.py`
- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应
- node.dependencies: 依赖的上游node列表
- node.dependencies.name: 与workflow内节点的name保持一致
- node.dependencies.mode: RO-Read Only, RW-Read Write
#### 2.4 resource.prototxt
resource.prototxt,用于指定模型配置文件。通过`--resource_path``--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto``ResourceConf`。示例如下:
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
```
其中:
- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
- general_model_path: 用来指定general_model.prototxt所在的目录
- general_model_file: 用来指定general_model.prototxt所在的文件名
#### 2.5 model_toolkit.prototxt
用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下:
```
engines {
name: "general_infer_0"
type: "PADDLE_INFER"
reloadable_meta: "uci_housing_model/fluid_time_file"
reloadable_type: "timestamp_ne"
model_dir: "uci_housing_model"
gpu_ids: -1
enable_memory_optimization: true
enable_ir_optimization: false
use_trt: false
use_lite: false
use_xpu: false
use_gpu: false
combined_model: false
gpu_multi_stream: false
runtime_thread_num: 0
batch_infer_size: 32
enable_overrun: false
allow_split_request: true
}
```
其中
- name: 引擎名称,与workflow.prototxt中的node.name以及所在目录名称对应
- type: 预测引擎的类型。当前只支持”PADDLE_INFER“
- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值
- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|含义|
|---------------|----|
|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化|
|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳|
|md5sum|目前无用,配置后永远不reload|
|revision|目前无用,配置后用于不reload|
- model_dir: 模型文件路径
- gpu_ids: 引擎运行时使用的GPU device id,支持指定多个,如:
```
# 指定GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: 是否开启memory优化
- enable_ir_optimization: 是否开启ir优化
- use_trt: 是否开启TensorRT,需同时开启use_gpu
- use_lite: 是否开启PaddleLite
- use_xpu: 是否使用昆仑XPU
- use_gpu:是否使用GPU
- combined_model: 是否使用组合模型文件
- gpu_multi_stream: 是否开启gpu多流模式
- runtime_thread_num: 若大于0, 则启用Async异步模式,并创建对应数量的predictor实例。
- batch_infer_size: Async异步模式下的最大batch数
- enable_overrun: Async异步模式下总是将整个任务放入任务队列
- allow_split_request: Async异步模式下允许拆分任务
#### 2.6 general_model.prototxt
general_model.prototxt内容与模型配置serving_server_conf.prototxt相同,用了描述模型输入输出参数信息。示例如下:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "fc_0.tmp_1"
alias_name: "price"
is_lod_tensor: false
fetch_type: 1
shape: 1
}
```
## Python Pipeline
Python Pipeline提供了用户友好的多模型组合服务编程框架,适用于多模型组合应用的场景。
其配置文件为YAML格式,一般默认为config.yaml。示例如下:
```YAML
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 18090
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 9999
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: false
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
#重试次数
retry: 1
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
use_profile: false
tracer:
interval_s: 10
op:
det:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 6
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#det模型路径
model_config: ocr_det_model
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["concat_1.tmp_0"]
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
rec:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 3
#超时时间, 单位ms
timeout: -1
#Serving交互重试次数,默认不重试
retry: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor。local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#rec模型路径
model_config: ocr_rec_model
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
```
### 单机多卡
单机多卡推理,M个OP进程与N个GPU卡绑定,需要在config.ymal中配置3个参数。首先选择进程模式,这样并发数即进程数,然后配置devices。绑定方法是进程启动时遍历GPU卡ID,例如启动7个OP进程,设置了0,1,2三个device id,那么第1、4、7个启动的进程与0卡绑定,第2、5进程与1卡绑定,3、6进程与卡2绑定。
```YAML
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 7
devices: "0,1,2"
```
### 异构硬件
Python Pipeline除了支持CPU、GPU之外,还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定,当其空缺时根据devices自动判断类型。device_type描述如下:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
config.yml中硬件配置:
```YAML
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" # "0,1"
```
### 低精度推理
Python Pipeline支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下所示:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16(TRT下有效)
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
```YAML
#precsion, 预测精度,降低预测精度可提升预测速度
#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
precision: "fp32"
```
\ No newline at end of file
# Serving Configuration
([简体中文](./Serving_Configure_CN.md)|English)
## Overview
This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.
## Model Configuration
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](./Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
Example:
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "concat_1.tmp_0"
alias_name: "concat_1.tmp_0"
is_lod_tensor: false
fetch_type: 1
shape: 3
shape: 640
shape: 640
}
```
- feed_var:model input
- fetch_var:model output
- name:node name
- alias_name:alias name
- is_lod_tensor:lod tensor, ref to [Lod Introduction](./LOD_EN.md)
- feed_type/fetch_type:data type
|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
- shape:tensor shape
## C++ Serving
### 1. Quick start and stop
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
Example starting c++ serving:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
This command will generate the server configuration files as `workdir_9393`:
```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```
More flags:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### Serving model with multiple gpus.
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
#### Serving two models.
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
#### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to C++ Serving. When setting `kill`, SIGKILL will be sent to C++ Serving
### 2. Starting with user-defined Configuration
Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.
Example starting with self-defined config:
```BASH
/bin/serving --flagfile=proj.conf
```
#### 2.1 proj.conf
You can provide proj.conf with lots of flags:
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
The table below sets out the detailed description:
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|
#### 2.2 service.prototxt
To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
```
port: 8010
services {
name: "GeneralModelService"
workflows: "workflow1"
}
```
- port: Listening port.
- services: No need to modify. The workflow1 is defined in workflow.prototxt.
#### 2.3 workflow.prototxt
To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.
In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.
```
workflows {
name: "workflow1"
workflow_type: "Sequence"
nodes {
name: "general_reader_0"
type: "GeneralReaderOp"
}
nodes {
name: "general_infer_0"
type: "GeneralInferOp"
dependencies {
name: "general_reader_0"
mode: "RO"
}
}
nodes {
name: "general_response_0"
type: "GeneralResponseOp"
dependencies {
name: "general_infer_0"
mode: "RO"
}
}
}
```
- name: The name of workflow.
- workflow_type: "Sequence"
- nodes: A workflow consists of nodes.
- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
- node.type: The bound operator. Ref to OPS in `serving/op`.
- node.dependencies: The list of upstream dependent operators.
- node.dependencies.name: The name of dependent operators.
- node.dependencies.mode: RO-Read Only, RW-Read Write
#### 2.4 resource.prototxt
You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
```
- model_toolkit_path: The diectory path of model_toolkil.prototxt.
- model_toolkit_file: The file name of model_toolkil.prototxt.
- general_model_path: The diectory path of general_model.prototxt.
- general_model_file: The file name of general_model.prototxt.
#### 2.5 model_toolkit.prototxt
The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.
Example using cpu engine:
```
engines {
name: "general_infer_0"
type: "PADDLE_INFER"
reloadable_meta: "uci_housing_model/fluid_time_file"
reloadable_type: "timestamp_ne"
model_dir: "uci_housing_model"
gpu_ids: -1
enable_memory_optimization: true
enable_ir_optimization: false
use_trt: false
use_lite: false
use_xpu: false
use_gpu: false
combined_model: false
gpu_multi_stream: false
runtime_thread_num: 0
batch_infer_size: 32
enable_overrun: false
allow_split_request: true
}
```
- name: The name of engine corresponding to the node name in workflow.prototxt.
- type: Only support ”PADDLE_INFER“
- reloadable_meta: Specify the mark file of reload.
- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
|reloadable_type|Description|
|---------------|----|
|timestamp_ne|when the mtime of reloadable_meta file changed|
|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
|md5sum|No use|
|revision|No use|
- model_dir: The path of model files.
- gpu_ids: Specify the gpu ids. Support multiple device ids:
```
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: Enable memory optimization.
- enable_ir_optimization: Enable ir optimization.
- use_trt: Enable Tensor RT. Need use_gpu on.
- use_lite: Enable PaddleLite.
- use_xpu: Enable KUNLUN XPU.
- use_gpu: Enbale GPU.
- combined_model: Enable combined model.
- gpu_multi_stream: Enable gpu multiple stream mode.
- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
- batch_infer_size: The max batch size of Async mode.
- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
- allow_split_request: Allow to split request task in Async mode.
#### 2.6 general_model.prototxt
The content of general_model.prototxt is same as serving_server_conf.prototxt.
```
feed_var {
name: "x"
alias_name: "x"
is_lod_tensor: false
feed_type: 1
shape: 13
}
fetch_var {
name: "fc_0.tmp_1"
alias_name: "price"
is_lod_tensor: false
fetch_type: 1
shape: 1
}
```
## Python Pipeline
### Quick start and stop
Example starting Pipeline Serving:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to Pipeline Serving. When setting `kill`, SIGKILL will be sent to Pipeline Serving
### yml Configuration
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
Example of config.yaml:
```YAML
#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
rpc_port: 18090
#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
http_port: 9999
#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
worker_num: 20
#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
build_dag_each_worker: false
dag:
#True, thread model;False,process model
is_thread_op: False
#retry times
retry: 1
# True,generate the TimeLine data;False
use_profile: false
tracer:
interval_s: 10
op:
det:
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 6
#Loading local server configuration without server_endpoints.
local_service_conf:
#client type,include brpc, grpc and local_predictor.
client_type: local_predictor
#det model path
model_config: ocr_det_model
#Fetch data list
fetch_list: ["concat_1.tmp_0"]
#Device ID
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
rec:
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 3
#time out, ms
timeout: -1
#retry times
retry: 1
#Loading local server configuration without server_endpoints.
local_service_conf:
#client type,include brpc, grpc and local_predictor.
client_type: local_predictor
#rec model path
model_config: ocr_rec_model
#Fetch data list
fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
#Device ID
devices: ""
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#use_mkldnn
#use_mkldnn: True
#ir_optim
ir_optim: True
```
### Single-machine and multi-card inference
Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
Reference config.yaml:
```YAML
#True, thread model;False,process model
is_thread_op: False
#concurrency,is_thread_op=True,thread otherwise process
concurrency: 7
devices: "0,1,2"
```
### Heterogeneous Devices
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
Reference config.yaml:
```YAML
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
devices: "" # "0,1"
```
### Low precision inference
Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16(TRT effects)
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
```YAML
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
precision: "fp32"
```
# Paddle Serving设计文档 # Paddle Serving设计文档
(简体中文|[English](./DESIGN_DOC.md)) (简体中文|[English](./Serving_Design_EN.md))
## 1. 设计目标 ## 1. 设计目标
...@@ -55,15 +55,15 @@ Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会 ...@@ -55,15 +55,15 @@ Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会
> 跨平台运行 > 跨平台运行
跨平台是不依赖于操作系统,也不依赖硬件环境。一个操作系统下开发的应用,放到另一个操作系统下依然可以运行。因此,设计上既要考虑开发语言、组件是跨平台的,同时也要考虑不同系统上编译器的解释差异。 跨平台是不依赖于操作系统,也不依赖硬件环境。一个操作系统下开发的应用,放到另一个操作系统下依然可以运行。因此,设计上既要考虑开发语言、组件是跨平台的,同时也要考虑不同系统上编译器的解释差异。
Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](DOCKER_IMAGES_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](RUN_IN_DOCKER_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](WINDOWS_TUTORIAL_CN.md) Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](./Docker_Images_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](./Run_In_Dokcer_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](./Windows_Tutorial_CN.md)
> 支持多种开发语言SDK > 支持多种开发语言SDK
Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。 Paddle Serving提供了3种开发语言SDK,包括Python、C++、Java。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。
+ Python,参考python/examples下client示例 或 4.2 web服务示例 + Python,参考python/examples下client示例 或 4.2 web服务示例
+ C++,参考《[从零开始写一个预测服务](CREATING.md) + C++,参考《[从零开始写一个预测服务](./C++_Serving/Creat_C++Serving_CN.md)
+ Java,参考《[Paddle Serving Client Java SDK](JAVA_SDK_CN.md) + Java,参考《[Paddle Serving Client Java SDK](./Java_SDK_CN.md)
+ Golang,参考《[如何在Paddle Serving使用Go Client](deprecated/IMDB_GO_CLIENT_CN.md)
> 支持多种硬件设备 > 支持多种硬件设备
...@@ -76,7 +76,7 @@ Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang ...@@ -76,7 +76,7 @@ Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang
以IMDB评论情感分析任务为例通过9步展示,Paddle Serving从模型的训练到部署预测服务的全流程《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) 以IMDB评论情感分析任务为例通过9步展示,Paddle Serving从模型的训练到部署预测服务的全流程《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。 由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](./Save_CN.md)》。
``` ```
feed_var { feed_var {
name: "x" name: "x"
...@@ -124,15 +124,15 @@ C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点 ...@@ -124,15 +124,15 @@ C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点
### 3.3 模型管理与热加载 ### 3.3 模型管理与热加载
Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)》。 Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](./C++_Serving/Hot_Loading_CN.md)》。
### 3.4 模型加解密 ### 3.4 模型加解密
Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](ENCRYPTION_CN.md) Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](./C++_Serving/Encryption_CN.md)
### 3.5 A/B Test ### 3.5 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)》。 在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](./C++_Serving/ABTEST_CN.md)》。
<p align="center"> <p align="center">
<br> <br>
...@@ -193,7 +193,7 @@ Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC ...@@ -193,7 +193,7 @@ Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC
</center> </center>
### 5.2 核心设计与使用用例 ### 5.2 核心设计与使用用例
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md) Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](./Python_Pipeline/Pipeline_Design_CN.md)
<center> <center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/> <img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</center> </center>
...@@ -201,11 +201,8 @@ Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Chann ...@@ -201,11 +201,8 @@ Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Chann
## 6. 未来计划 ## 6. 未来计划
### 6.1 云端自动部署能力 ### 6.1 向量检索、树结构检索
为了方便用户更容易将Paddle的预测模型部署到线上,Paddle Serving在接下来的版本会提供Kubernetes生态下任务编排的工具。
### 6.2 向量检索、树结构检索
在推荐与广告场景的召回系统中,通常需要采用基于向量的快速检索或者基于树结构的快速检索,Paddle Serving会对这方面的检索引擎进行集成或扩展。 在推荐与广告场景的召回系统中,通常需要采用基于向量的快速检索或者基于树结构的快速检索,Paddle Serving会对这方面的检索引擎进行集成或扩展。
### 6.3 服务监控 ### 6.2 服务监控
集成普罗米修斯监控,一套开源的监控&报警&时间序列数据库的组合,适合k8s和docker的监控系统。 集成普罗米修斯监控,一套开源的监控&报警&时间序列数据库的组合,适合k8s和docker的监控系统。
# Paddle Serving Design Doc # Paddle Serving Design Doc
([简体中文](./DESIGN_DOC_CN.md)|English) ([简体中文](./Serving_Design_CN.md)|English)
## 1. Design Objectives ## 1. Design Objectives
...@@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating ...@@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating
Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems. Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems.
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md) Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](Docker_Images_EN.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](Run_In_Docker_EN.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
> Support multiple development languages client ​​SDKs > Support multiple development languages client ​​SDKs
Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR. Paddle Serving provides 3 development language client SDKs, including Python, C++, Java, we hope that interested open source developers can help submit PR.
+ Python, Refer to the client example under python/examples or 4.2 web service example. + Python, Refer to the client example under python/examples or 4.2 web service example.
+ C++, Refer to《[从零开始写一个预测服务](CREATING.md) + C++, Refer to《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)
+ Java, Refer to《[Paddle Serving Client Java SDK](JAVA_SDK.md) + Java, Refer to《[Paddle Serving Client Java SDK](Java_SDK_EN.md)
+ Golang, Refer to《[How to use Go Client of Paddle Serving](deprecated/IMDB_GO_CLIENT.md)
> Support multiple hardware devices > Support multiple hardware devices
...@@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C ...@@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C
Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945) Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》. Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](Save_EN.md)》.
``` ```
feed_var { feed_var {
name: "x" name: "x"
...@@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In ...@@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
<p> <p>
### 3.3 Model Management and Hot Reloading ### 3.3 Model Management and Hot Reloading
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples. C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](C++_Serving/Hot_Loading_EN.md) for specific examples.
### 3.4 MOEDL ENCRYPTION INFERENCE ### 3.4 MOEDL ENCRYPTION INFERENCE
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md) Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](C++_Serving/Encryption_EN.md)
### 3.5 A/B Test ### 3.5 A/B Test
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples. After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](C++_Serving/ABTest_EN.md) for specific examples.
<p align="center"> <p align="center">
<br> <br>
...@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s ...@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
### 5.2 Core Design And Use Cases ### 5.2 Core Design And Use Cases
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md) The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](Python_Pipeline/Pipeline_Design_EN.md)
<center> <center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/> <img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
......
## Paddle Serving uses TensorRT
(English|[简体中文](./TENSOR_RT_CN.md))
### Background
Deploying models trained on mainstream frameworks through the tensorRT tool launched by Nvidia can greatly increase the speed of model inference, which is often at least 1 times faster than the original framework, and it also takes up more device memory. less. Therefore, it is very useful for all users who need to deploy models to master the method of deploying deep learning models with tensorRT. Paddle Serving provides comprehensive TensorRT ecological support.
### surroundings
Serving Cuda10.1 Cuda10.2 and Cuda11 versions support TensorRT.
#### Install Paddle
In [Development using Docker environment](./RUN_IN_DOCKER.md) and [Docker image list](./DOCKER_IMAGES.md), we give the development image of TensorRT. After using the mirror to start, you need to install the Paddle whl package that supports TensorRT, refer to the documentation on the home page
```
# GPU Cuda10.2 environment please execute
pip install paddlepaddle-gpu==2.0.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the URL link of the corresponding GPU environment and install it. For example, for Python2.7 users of Cuda 10.1, please select `cp27-cp27mu` and
`cuda10.1-cudnn7.6-trt6.0.1.5` corresponding url, copy it and execute
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
Since the default `paddlepaddle-gpu==2.0.0` is Cuda 10.2 and TensorRT is not built, if you need to use TensorRT on `paddlepaddle-gpu`, you need to find `cuda10 in the above multi-version whl package list .2-cudnn8.0-trt7.1.3`, download the corresponding Python version.
#### Install Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### Use TensorRT
#### RPC mode
In [Serving model example](../python/examples), we have given models that can be accelerated using TensorRT, such as [Faster_RCNN model](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) under detection
We just need
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
The TensorRT version of the faster_rcnn model server is started
#### Local Predictor mode
In [local_predictor](../python/paddle_serving_app/local_predict.py#L52), users can explicitly specify `use_trt=True` and pass it to `load_model_config`.
Other methods are no different from other Local Predictor methods, and you need to pay attention to the compatibility of the model with TensorRT.
#### Pipeline Mode
In [Pipeline mode](./PIPELINE_SERVING.md), our [imagenet example](../python/examples/pipeline/imagenet/config.yml#L23) gives the way to set TensorRT.
## Paddle Serving 使用 TensorRT
([English](./TENSOR_RT.md)|简体中文)
### 背景
通过Nvidia推出的tensorRT工具来部署主流框架上训练的模型能够极大的提高模型推断的速度,往往相比与原本的框架能够有至少1倍以上的速度提升,同时占用的设备内存也会更加的少。因此对是所有需要部署模型的用户来说,掌握用tensorRT来部署深度学习模型的方法是非常有用的。Paddle Serving提供了全面的TensorRT生态支持。
### 环境
Serving 的Cuda10.1 Cuda10.2和Cuda11版本支持TensorRT。
#### 安装Paddle
[使用Docker环境开发](./RUN_IN_DOCKER_CN.md)[Docker镜像列表](./DOCKER_IMAGES_CN.md)当中,我们给出了TensorRT的开发镜像。使用镜像启动之后,需要安装支持TensorRT的Paddle whl包,参考首页的文档
```
# GPU Cuda10.2环境请执行
pip install paddlepaddle-gpu==2.0.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表
](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python2.7用户,请选择表格当中的`cp27-cp27mu`
`cuda10.1-cudnn7.6-trt6.0.1.5`对应的url,复制下来并执行
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.0.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。
#### 安装Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### 使用TensorRT
#### RPC模式
[Serving模型示例](../python/examples)当中,我们有给出可以使用TensorRT加速的模型,例如detection下的[Faster_RCNN模型](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco)
我们只需
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
TensorRT版本的faster_rcnn模型服务端就启动了
#### Local Predictor模式
[local_predictor](../python/paddle_serving_app/local_predict.py#L52)当中,用户可以显式制定`use_trt=True`传入到`load_model_config`当中。
其他方式和其他Local Predictor使用方法没有区别,需要注意模型对TensorRT的兼容性。
#### Pipeline模式
[Pipeline模式](./PIPELINE_SERVING_CN.md)当中,我们的[imagenet例子](../python/examples/pipeline/imagenet/config.yml#L23)给出了设置TensorRT的方式。
## Windows平台使用Paddle Serving指导 ## Windows平台使用Paddle Serving指导
([English](./WINDOWS_TUTORIAL.md)|简体中文) ([English](./Windows_Tutorial_EN.md)|简体中文)
### 综述 ### 综述
...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt ...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt
**运行OCR示例** **运行OCR示例**
``` ```
cd Serving/python/example/ocr cd Serving/examples/C++/PaddleOCR/ocr/
python -m paddle_serving_app.package --get_model ocr_rec python -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz tar -xzvf ocr_rec.tar.gz
python -m paddle_serving_app.package --get_model ocr_det python -m paddle_serving_app.package --get_model ocr_det
...@@ -70,7 +70,7 @@ class YourWebService(WebService): ...@@ -70,7 +70,7 @@ class YourWebService(WebService):
your_service = YourService(name="XXX") your_service = YourService(name="XXX")
your_service.load_model_config("your_model_path") your_service.load_model_config("your_model_path")
your_service.prepare_server(workdir="workdir", port=9292) your_service.prepare_server(workdir="workdir", port=9292)
# 如果是GPU用户,可以参照python/examples/ocr下的python示例 # 如果是GPU用户,可以参照Serving/examples/Pipeline/PaddleOCR/ocr下的python示例
your_service.run_debugger_service() your_service.run_debugger_service()
# Windows平台不可以使用 run_rpc_service()接口 # Windows平台不可以使用 run_rpc_service()接口
your_service.run_web_service() your_service.run_web_service()
...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data)) ...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json()) print(r.json())
``` ```
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./NEW_WEB_SERVICE_CN.md) 用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./C++_Serving/Http_Service_CN.md)
开发完成后执行 开发完成后执行
......
## Paddle Serving for Windows Users ## Paddle Serving for Windows Users
(English|[简体中文](./WINDOWS_TUTORIAL_CN.md)) (English|[简体中文](./Windows_Tutorial_CN.md))
### Summary ### Summary
...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt ...@@ -38,7 +38,7 @@ pip install -r python/requirements_win.txt
**Run OCR example**: **Run OCR example**:
``` ```
cd Serving/python/example/ocr cd Serving/examples/C++/PaddleOCR/ocr/
python -m paddle_serving_app.package --get_model ocr_rec python -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz tar -xzvf ocr_rec.tar.gz
python -m paddle_serving_app.package --get_model ocr_det python -m paddle_serving_app.package --get_model ocr_det
...@@ -70,7 +70,7 @@ class YourWebService(WebService): ...@@ -70,7 +70,7 @@ class YourWebService(WebService):
your_service = YourService(name="XXX") your_service = YourService(name="XXX")
your_service.load_model_config("your_model_path") your_service.load_model_config("your_model_path")
your_service.prepare_server(workdir="workdir", port=9292) your_service.prepare_server(workdir="workdir", port=9292)
# If you are a GPU user, you can refer to the python example under python/examples/ocr # If you are a GPU user, you can refer to the python example under Serving/examples/Pipeline/PaddleOCR/ocr
your_service.run_debugger_service() your_service.run_debugger_service()
# Windows platform cannot use run_rpc_service() interface # Windows platform cannot use run_rpc_service() interface
your_service.run_web_service() your_service.run_web_service()
...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data)) ...@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json()) print(r.json())
``` ```
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./NEW_WEB_SERVICE.md) The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./C++_Serving/Http_Service_CN.md)
Execute after development Execute after development
......
# C++ Serving Design
([简体中文](./C++DESIGN_CN.md)|English)
## 1. Background
PaddlePaddle is the Baidu's open source machine learning framework, which supports a wide range of customized development of deep learning models; Paddle serving is the online prediction framework of Paddle, which seamlessly connects with Paddle model training, and provides cloud services for machine learning prediction. This article will describe the Paddle Serving design from the bottom up, from the model, service, and access levels.
1. The model is the core of Paddle Serving prediction, including the management of model data and inference calculations;
2. Prediction framework encapsulation model for inference calculations, providing external RPC interface to connect different upstream
3. The prediction service SDK provides a set of access frameworks
The result is a complete serving solution.
## 2. Terms explanation
- **baidu-rpc**: Baidu's official open source RPC framework, supports multiple common communication protocols, and provides a custom interface experience based on protobuf
- **Variant**: Paddle Serving architecture is an abstraction of a minimal prediction cluster, which is characterized by all internal instances (replicas) being completely homogeneous and logically corresponding to a fixed version of a model
- **Endpoint**: Multiple Variants form an Endpoint. Logically, Endpoint represents a model, and Variants within the Endpoint represent different versions.
- **OP**: PaddlePaddle is used to encapsulate a numerical calculation operator, Paddle Serving is used to represent a basic business operation operator, and the core interface is inference. OP configures its dependent upstream OP to connect multiple OPs into a workflow
- **Channel**: An abstraction of all request-level intermediate data of the OP; data exchange between OPs through Channels
- **Bus**: manages all channels in a thread, and schedules the access relationship between the two sets of OP and Channel according to the DAG dependency graph between DAGs
- **Stage**: Workflow according to the topology diagram described by DAG, a collection of OPs that belong to the same link and can be executed in parallel
- **Node**: An OP operator instance composed of an OP operator class combined with parameter configuration, which is also an execution unit in Workflow
- **Workflow**: executes the inference interface of each OP in order according to the topology described by DAG
- **DAG/Workflow**: consists of several interdependent Nodes. Each Node can obtain the Request object through a specific interface. The node Op obtains the output object of its pre-op through the dependency relationship. The output of the last Node is the Response object by default.
- **Service**: encapsulates a pv request, can configure several Workflows, reuse the current PV's Request object with each other, and then execute each in parallel/serial execution, and finally write the Response to the corresponding output slot; a Paddle-serving process Multiple sets of Service interfaces can be configured. The upstream determines the Service interface currently accessed based on the ServiceName.
## 3. Python Interface Design
### 3.1 Core Targets:
A set of Paddle Serving dynamic library, support the remote estimation service of the common model saved by Paddle, and call the various underlying functions of PaddleServing through the Python Interface.
### 3.2 General Model:
Models that can be predicted using the Paddle Inference Library, models saved during training, including Feed Variable and Fetch Variable
### 3.3 Overall design:
- The user starts the Client and Server through the Python Client. The Python API has a function to check whether the interconnection and the models to be accessed match.
- The Python API calls the pybind corresponding to the client and server functions implemented by Paddle Serving, and the information transmitted through RPC is implemented through RPC.
- The Client Python API currently has two simple functions, load_inference_conf and predict, which are used to perform loading of the model to be predicted and prediction, respectively.
- The Server Python API is mainly responsible for loading the inference model and generating various configurations required by Paddle Serving, including engines, workflow, resources, etc.
### 3.4 Server Inferface
![Server Interface](images/server_interface.png)
### 3.5 Client Interface
<img src='images/client_inferface.png' width = "600" height = "200">
### 3.6 Client io used during Training
PaddleServing is designed to saves the model interface that can be used during the training process, which is basically the same as the Paddle save inference model interface, feed_var_dict and fetch_var_dict
You can alias the input and output variables. The configuration that needs to be read when the serving starts is saved in the client and server storage directories.
``` python
def save_model(server_model_folder,
client_config_folder,
feed_var_dict,
fetch_var_dict,
main_program=None)
```
## 4. Paddle Serving Underlying Framework
![Paddle-Serging Overall Architecture](images/framework.png)
**Model Management Framework**: Connects model files of multiple machine learning platforms and provides a unified inference interface
**Business Scheduling Framework**: Abstracts the calculation logic of various different inference models, provides a general DAG scheduling framework, and connects different operators through DAG diagrams to complete a prediction service together. This abstract model allows users to conveniently implement their own calculation logic, and at the same time facilitates operator sharing. (Users build their own forecasting services. A large part of their work is to build DAGs and provide operators.)
**Predict Service**: Encapsulation of the externally provided prediction service interface. Define communication fields with the client through protobuf.
### 4.1 Model Management Framework
The model management framework is responsible for managing the models trained by the machine learning framework. It can be abstracted into three levels: model loading, model data, and model reasoning.
#### Model Loading
Load model from disk to memory, support multi-version, hot-load, incremental update, etc.
#### Model data
Model data structure in memory, integrated fluid inference lib
#### inferencer
it provided united inference interface for upper layers
```C++
class FluidFamilyCore {
virtual bool Run(const void* in_data, void* out_data);
virtual int create(const std::string& data_path);
virtual int clone(void* origin_core);
};
```
### 4.2 Business Scheduling Framework
#### 4.2.1 Inference Service
With reference to the abstract idea of model calculation of the TensorFlow framework, the business logic is abstracted into a DAG diagram, driven by configuration, generating a workflow, and skipping C ++ code compilation. Each specific step of the service corresponds to a specific OP. The OP can configure the upstream OP that it depends on. Unified message passing between OPs is achieved by the thread-level bus and channel mechanisms. For example, the service process of a simple prediction service can be abstracted into 3 steps including reading request data-> calling the prediction interface-> writing back the prediction result, and correspondingly implemented to 3 OP: ReaderOp-> ClassifyOp-> WriteOp
![Infer Service](images/predict-service.png)
Regarding the dependencies between OPs, and the establishment of workflows through OPs, you can refer to [从零开始写一个预测服务](CREATING.md) (simplified Chinese Version)
Server instance perspective
![Server instance perspective](images/server-side.png)
#### 4.2.2 Paddle Serving Multi-Service Mechanism
![Paddle Serving multi-service](images/multi-service.png)
Paddle Serving instances can load multiple models at the same time, and each model uses a Service (and its configured workflow) to undertake services. You can refer to [service configuration file in Demo example](../tools/cpp_examples/demo-serving/conf/service.prototxt) to learn how to configure multiple services for the serving instance
#### 4.2.3 Hierarchical relationship of business scheduling
From the client's perspective, a Paddle Serving service can be divided into three levels: Service, Endpoint, and Variant from top to bottom.
![Call hierarchy relationship](images/multi-variants.png)
One Service corresponds to one inference model, and there is one endpoint under the model. Different versions of the model are implemented through multiple variant concepts under endpoint:
The same model prediction service can configure multiple variants, and each variant has its own downstream IP list. The client code can configure relative weights for each variant to achieve the relationship of adjusting the traffic ratio (refer to the description of variant_weight_list in [Client Configuration](CLIENT_CONFIGURE.md) section 3.2).
![Client-side proxy function](images/client-side-proxy.png)
## 5. User Interface
Under the premise of meeting certain interface specifications, the service framework does not make any restrictions on user data fields to meet different business interfaces of various forecast services. Baidu-rpc inherits the interface of Protobuf serice, and the user describes the Request and Response business interfaces according to the Protobuf syntax specification. Paddle Serving is built on the Baidu-rpc framework and supports this feature by default.
No matter how the communication protocol changes, the framework only needs to ensure that the communication protocol between the client and server and the format of the business data are synchronized to ensure normal communication. This information can be broken down as follows:
-Protocol: Header information agreed in advance between Server and Client to ensure mutual recognition of data format. Paddle Serving uses Protobuf as the basic communication format
-Data: Used to describe the interface of Request and Response, such as the sample data to be predicted, and the score returned by the prediction. include:
   -Data fields: Field definitions included in the two data structures of Request and Return.
   -Description interface: similar to the protocol interface, it supports Protobuf by default
### 5.1 Data Compression Method
Baidu-rpc has built-in data compression methods such as snappy, gzip, zlib, which can be configured in the configuration file (refer to [Client Configuration](CLIENT_CONFIGURE.md) Section 3.1 for an introduction to compress_type)
### 5.2 C ++ SDK API Interface
```C++
class PredictorApi {
public:
int create(const char* path, const char* file);
int thrd_initialize();
int thrd_clear();
int thrd_finalize();
void destroy();
Predictor* fetch_predictor(std::string ep_name);
int free_predictor(Predictor* predictor);
};
class Predictor {
public:
// synchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res) = 0;
// asynchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res,
DoneType done,
brpc::CallId* cid = NULL) = 0;
// synchronize interface
virtual int debug(google::protobuf::Message* req,
google::protobuf::Message* res,
butil::IOBufBuilder* debug_os) = 0;
};
```
### 5.3 Inferfaces related to Op
```C++
class Op {
// ------Getters for Channel/Data/Message of dependent OP-----
// Get the Channel object of dependent OP
Channel* mutable_depend_channel(const std::string& op);
// Get the Channel object of dependent OP
const Channel* get_depend_channel(const std::string& op) const;
template <typename T>
T* mutable_depend_argument(const std::string& op);
template <typename T>
const T* get_depend_argument(const std::string& op) const;
// -----Getters for Channel/Data/Message of current OP----
// Get pointer to the progobuf message of current OP
google::protobuf::Message* mutable_message();
// Get pointer to the protobuf message of current OP
const google::protobuf::Message* get_message() const;
// Get the template class data object of current OP
template <typename T>
T* mutable_data();
// Get the template class data object of current OP
template <typename T>
const T* get_data() const;
// ---------------- Other base class members ----------------
int init(Bus* bus,
Dag* dag,
uint32_t id,
const std::string& name,
const std::string& type,
void* conf);
int deinit();
int process(bool debug);
// Get the input object
const google::protobuf::Message* get_request_message();
const std::string& type() const;
uint32_t id() const;
// ------------------ OP Interface -------------------
// Get the derived Channel object of current OP
virtual Channel* mutable_channel() = 0;
// Get the derived Channel object of current OP
virtual const Channel* get_channel() const = 0;
// Release the derived Channel object of current OP
virtual int release_channel() = 0;
// Inference interface
virtual int inference() = 0;
// ------------------ Conf Interface -------------------
virtual void* create_config(const configure::DAGNode& conf) { return NULL; }
virtual void delete_config(void* conf) {}
virtual void set_config(void* conf) { return; }
// ------------------ Metric Interface -------------------
virtual void regist_metric() { return; }
};
```
### 5.4 Interfaces related to framework
Service
```C++
class InferService {
public:
static const char* tag() { return "service"; }
int init(const configure::InferService& conf);
int deinit() { return 0; }
int reload();
const std::string& name() const;
const std::string& full_name() const { return _infer_service_format; }
// Execute each workflow serially
virtual int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os = NULL);
int debug(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os);
};
class ParallelInferService : public InferService {
public:
// Execute workflows in parallel
int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os) {
return 0;
}
};
```
ServerManager
```C++
class ServerManager {
public:
typedef google::protobuf::Service Service;
ServerManager();
static ServerManager& instance() {
static ServerManager server;
return server;
}
static bool reload_starting() { return _s_reload_starting; }
static void stop_reloader() { _s_reload_starting = false; }
int add_service_by_format(const std::string& format);
int start_and_wait();
};
```
DAG
```C++
class Dag {
public:
EdgeMode parse_mode(std::string& mode); // NOLINT
int init(const char* path, const char* file, const std::string& name);
int init(const configure::Workflow& conf, const std::string& name);
int deinit();
uint32_t nodes_size();
const DagNode* node_by_id(uint32_t id);
const DagNode* node_by_id(uint32_t id) const;
const DagNode* node_by_name(std::string& name); // NOLINT
const DagNode* node_by_name(const std::string& name) const;
uint32_t stage_size();
const DagStage* stage_by_index(uint32_t index);
const std::string& name() const { return _dag_name; }
const std::string& full_name() const { return _dag_name; }
void regist_metric(const std::string& service_name);
};
```
Workflow
```C++
class Workflow {
public:
Workflow() {}
static const char* tag() { return "workflow"; }
// Each workflow object corresponds to an independent
// configure file, so you can share the object between
// different apps.
int init(const configure::Workflow& conf);
DagView* fetch_dag_view(const std::string& service_name);
int deinit() { return 0; }
void return_dag_view(DagView* view);
int reload();
const std::string& name() { return _name; }
const std::string& full_name() { return _name; }
};
```
# C++ Serving设计方案
(简体中文|[English](./C++DESIGN.md))
注意本页内容有已经过期,请查看:[设计文档](DESIGN_DOC_CN.md)
## 1. 项目背景
PaddlePaddle是百度开源的机器学习框架,广泛支持各种深度学习模型的定制化开发; Paddle Serving是Paddle的在线预测部分,与Paddle模型训练环节无缝衔接,提供机器学习预测云服务。本文将从模型、服务、接入等层面,自底向上描述Paddle Serving设计方案。
1. 模型是Paddle Serving预测的核心,包括模型数据和推理计算的管理;
2. 预测框架封装模型推理计算,对外提供RPC接口,对接不同上游;
3. 预测服务SDK提供一套接入框架
最终形成一套完整的serving解决方案。
## 2. 名词解释
- **baidu-rpc**: 百度官方开源RPC框架,支持多种常见通信协议,提供基于protobuf的自定义接口体验
- **Variant**: Paddle Serving架构对一个最小预测集群的抽象,其特点是内部所有实例(副本)完全同质,逻辑上对应一个model的一个固定版本
- **Endpoint**: 多个Variant组成一个Endpoint,逻辑上看,Endpoint代表一个model,Endpoint内部的Variant代表不同的版本
- **OP**: PaddlePaddle用来封装一种数值计算的算子,Paddle Serving用来表示一种基础的业务操作算子,核心接口是inference。OP通过配置其依赖的上游OP,将多个OP串联成一个workflow
- **Channel**: 一个OP所有请求级中间数据的抽象;OP之间通过Channel进行数据交互
- **Bus**: 对一个线程中所有channel的管理,以及根据DAG之间的DAG依赖图对OP和Channel两个集合间的访问关系进行调度
- **Stage**: Workflow按照DAG描述的拓扑图中,属于同一个环节且可并行执行的OP集合
- **Node**: 由某个OP算子类结合参数配置组成的OP算子实例,也是Workflow中的一个执行单元
- **Workflow**: 按照DAG描述的拓扑,有序执行每个OP的inference接口
- **DAG/Workflow**: 由若干个相互依赖的Node组成,每个Node均可通过特定接口获得Request对象,节点OP通过依赖关系获得其前置OP的输出对象,最后一个Node的输出默认就是Response对象
- **Service**: 对一次PV的请求封装,可配置若干条Workflow,彼此之间复用当前PV的Request对象,然后各自并行/串行执行,最后将Response写入对应的输出slot中;一个Paddle-serving进程可配置多套Service接口,上游根据ServiceName决定当前访问的Service接口。
## 3. Python Interface设计
### 3.1 核心目标:
完成一整套Paddle Serving的动态库,支持Paddle保存的通用模型的远程预估服务,通过Python Interface调用PaddleServing底层的各种功能。
### 3.2 通用模型:
能够使用Paddle Inference Library进行预测的模型,在训练过程中保存的模型,包含Feed Variable和Fetch Variable
### 3.3 整体设计:
- 用户通过Python Client启动Client和Server,Python API有检查互联和待访问模型是否匹配的功能
- Python API背后调用的是Paddle Serving实现的client和server对应功能的pybind,互传的信息通过RPC实现
- Client Python API当前有两个简单的功能,load_inference_conf和predict,分别用来执行加载待预测的模型和预测
- Server Python API主要负责加载预估模型,以及生成Paddle Serving需要的各种配置,包括engines,workflow,resource等
### 3.4 Server Inferface
![Server Interface](images/server_interface.png)
### 3.5 Client Interface
<img src='images/client_inferface.png' width = "600" height = "200">
### 3.6 训练过程中使用的Client io
PaddleServing设计可以在训练过程中使用的保存模型接口,与Paddle保存inference model的接口基本一致,feed_var_dict与fetch_var_dict
可以为输入和输出变量起别名,serving启动需要读取的配置会保存在client端和server端的保存目录中。
``` python
def save_model(server_model_folder,
client_config_folder,
feed_var_dict,
fetch_var_dict,
main_program=None)
```
## 4. Paddle Serving底层框架
![Paddle-Serging总体框图](images/framework.png)
**模型管理框架**:对接多种机器学习平台的模型文件,向上提供统一的inference接口
**业务调度框架**:对各种不同预测模型的计算逻辑进行抽象,提供通用的DAG调度框架,通过DAG图串联不同的算子,共同完成一次预测服务。该抽象模型使用户可以方便的实现自己的计算逻辑,同时便于算子共用。(用户搭建自己的预测服务,很大一部分工作是搭建DAG和提供算子的实现)
**PredictService**:对外部提供的预测服务接口封装。通过protobuf定义与客户端的通信字段。
### 4.1 模型管理框架
模型管理框架负责管理机器学习框架训练出来的模型,总体可抽象成模型加载、模型数据和模型推理等3个层次。
#### 模型加载
将模型从磁盘加载到内存,支持多版本、热加载、增量更新等功能
#### 模型数据
模型在内存中的数据结构,集成fluid预测lib
#### inferencer
向上为预测服务提供统一的预测接口
```C++
class FluidFamilyCore {
virtual bool Run(const void* in_data, void* out_data);
virtual int create(const std::string& data_path);
virtual int clone(void* origin_core);
};
```
### 4.2 业务调度框架
#### 4.2.1 预测服务Service
参考TF框架的模型计算的抽象思想,将业务逻辑抽象成DAG图,由配置驱动,生成workflow,跳过C++代码编译。业务的每个具体步骤,对应一个具体的OP,OP可配置自己依赖的上游OP。OP之间消息传递统一由线程级Bus和channel机制实现。例如,一个简单的预测服务的服务过程,可以抽象成读请求数据->调用预测接口->写回预测结果等3个步骤,相应的实现到3个OP: ReaderOp->ClassifyOp->WriteOp
![预测服务Service](images/predict-service.png)
关于OP之间的依赖关系,以及通过OP组建workflow,可以参考[从零开始写一个预测服务](CREATING.md)的相关章节
服务端实例透视图
![服务端实例透视图](images/server-side.png)
#### 4.2.2 Paddle Serving的多服务机制
![Paddle Serving的多服务机制](images/multi-service.png)
Paddle Serving实例可以同时加载多个模型,每个模型用一个Service(以及其所配置的workflow)承接服务。可以参考[Demo例子中的service配置文件](../tools/cpp_examples/demo-serving/conf/service.prototxt)了解如何为serving实例配置多个service
#### 4.2.3 业务调度层级关系
从客户端看,一个Paddle Serving service从顶向下可分为Service, Endpoint, Variant等3个层级
![调用层级关系](images/multi-variants.png)
一个Service对应一个预测模型,模型下有1个endpoint。模型的不同版本,通过endpoint下多个variant概念实现:
同一个模型预测服务,可以配置多个variant,每个variant有自己的下游IP列表。客户端代码可以对各个variant配置相对权重,以达到调节流量比例的关系(参考[客户端配置](CLIENT_CONFIGURE.md)第3.2节中关于variant_weight_list的说明)。
![Client端proxy功能](images/client-side-proxy.png)
## 5. 用户接口
在满足一定的接口规范前提下,服务框架不对用户数据字段做任何约束,以应对各种预测服务的不同业务接口。Baidu-rpc继承了Protobuf serice的接口,用户按照Protobuf语法规范描述Request和Response业务接口。Paddle Serving基于Baidu-rpc框架搭建,默认支持该特性。
无论通信协议如何变化,框架只需确保Client和Server间通信协议和业务数据两种信息的格式同步,即可保证正常通信。这些信息又可细分如下:
- 协议:Server和Client之间事先约定的、确保相互识别数据格式的包头信息。Paddle Serving用Protobuf作为基础通信格式
- 数据:用来描述Request和Response的接口,例如待预测样本数据,和预测返回的打分。包括:
- 数据字段:请求包Request和返回包Response两种数据结构包含的字段定义
- 描述接口:跟协议接口类似,默认支持Protobuf
### 5.1 数据压缩方法
Baidu-rpc内置了snappy, gzip, zlib等数据压缩方法,可在配置文件中配置(参考[客户端配置](CLIENT_CONFIGURE.md)第3.1节关于compress_type的介绍)
### 5.2 C++ SDK API接口
```C++
class PredictorApi {
public:
int create(const char* path, const char* file);
int thrd_initialize();
int thrd_clear();
int thrd_finalize();
void destroy();
Predictor* fetch_predictor(std::string ep_name);
int free_predictor(Predictor* predictor);
};
class Predictor {
public:
// synchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res) = 0;
// asynchronize interface
virtual int inference(google::protobuf::Message* req,
google::protobuf::Message* res,
DoneType done,
brpc::CallId* cid = NULL) = 0;
// synchronize interface
virtual int debug(google::protobuf::Message* req,
google::protobuf::Message* res,
butil::IOBufBuilder* debug_os) = 0;
};
```
### 5.3 OP相关接口
```C++
class Op {
// ------Getters for Channel/Data/Message of dependent OP-----
// Get the Channel object of dependent OP
Channel* mutable_depend_channel(const std::string& op);
// Get the Channel object of dependent OP
const Channel* get_depend_channel(const std::string& op) const;
template <typename T>
T* mutable_depend_argument(const std::string& op);
template <typename T>
const T* get_depend_argument(const std::string& op) const;
// -----Getters for Channel/Data/Message of current OP----
// Get pointer to the progobuf message of current OP
google::protobuf::Message* mutable_message();
// Get pointer to the protobuf message of current OP
const google::protobuf::Message* get_message() const;
// Get the template class data object of current OP
template <typename T>
T* mutable_data();
// Get the template class data object of current OP
template <typename T>
const T* get_data() const;
// ---------------- Other base class members ----------------
int init(Bus* bus,
Dag* dag,
uint32_t id,
const std::string& name,
const std::string& type,
void* conf);
int deinit();
int process(bool debug);
// Get the input object
const google::protobuf::Message* get_request_message();
const std::string& type() const;
uint32_t id() const;
// ------------------ OP Interface -------------------
// Get the derived Channel object of current OP
virtual Channel* mutable_channel() = 0;
// Get the derived Channel object of current OP
virtual const Channel* get_channel() const = 0;
// Release the derived Channel object of current OP
virtual int release_channel() = 0;
// Inference interface
virtual int inference() = 0;
// ------------------ Conf Interface -------------------
virtual void* create_config(const configure::DAGNode& conf) { return NULL; }
virtual void delete_config(void* conf) {}
virtual void set_config(void* conf) { return; }
// ------------------ Metric Interface -------------------
virtual void regist_metric() { return; }
};
```
### 5.4 框架相关接口
Service
```C++
class InferService {
public:
static const char* tag() { return "service"; }
int init(const configure::InferService& conf);
int deinit() { return 0; }
int reload();
const std::string& name() const;
const std::string& full_name() const { return _infer_service_format; }
// Execute each workflow serially
virtual int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os = NULL);
int debug(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os);
};
class ParallelInferService : public InferService {
public:
// Execute workflows in parallel
int inference(const google::protobuf::Message* request,
google::protobuf::Message* response,
butil::IOBufBuilder* debug_os) {
return 0;
}
};
```
ServerManager
```C++
class ServerManager {
public:
typedef google::protobuf::Service Service;
ServerManager();
static ServerManager& instance() {
static ServerManager server;
return server;
}
static bool reload_starting() { return _s_reload_starting; }
static void stop_reloader() { _s_reload_starting = false; }
int add_service_by_format(const std::string& format);
int start_and_wait();
};
```
DAG
```C++
class Dag {
public:
EdgeMode parse_mode(std::string& mode); // NOLINT
int init(const char* path, const char* file, const std::string& name);
int init(const configure::Workflow& conf, const std::string& name);
int deinit();
uint32_t nodes_size();
const DagNode* node_by_id(uint32_t id);
const DagNode* node_by_id(uint32_t id) const;
const DagNode* node_by_name(std::string& name); // NOLINT
const DagNode* node_by_name(const std::string& name) const;
uint32_t stage_size();
const DagStage* stage_by_index(uint32_t index);
const std::string& name() const { return _dag_name; }
const std::string& full_name() const { return _dag_name; }
void regist_metric(const std::string& service_name);
};
```
Workflow
```C++
class Workflow {
public:
Workflow() {}
static const char* tag() { return "workflow"; }
// Each workflow object corresponds to an independent
// configure file, so you can share the object between
// different apps.
int init(const configure::Workflow& conf);
DagView* fetch_dag_view(const std::string& service_name);
int deinit() { return 0; }
void return_dag_view(DagView* view);
int reload();
const std::string& name() { return _name; }
const std::string& full_name() { return _name; }
};
```
# How to develop a new Web service?
([简体中文](NEW_WEB_SERVICE_CN.md)|English)
This document will take Uci service as an example to introduce how to develop a new Web Service. You can check out the complete code [here](../python/examples/pipeline/simple_web_service/web_service.py).
## Op base class
In some services, a single model may not meet business needs, requiring multiple models to be concatenated or parallel to complete the entire service. We call a single model operation Op and provide a simple set of interfaces to implement the complex logic of Op concatenation or parallelism.
Data between Ops is passed as a dictionary, Op can be started as threads or process, and Op can be configured for the number of concurrencies, etc.
Typically, you need to inherit the Op base class and override its `init_op`, `preprocess` and `postprocess` methods, which are implemented by default as follows:
```python
class Op(object):
def init_op(self):
pass
def preprocess(self, input_dicts):
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func.")
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict
def postprocess(self, input_dicts, fetch_dict):
return fetch_dict
```
### init_op
This method is used to load user-defined resources such as dictionaries. A separator is loaded in the [UciOp](../python/examples/pipeline/simple_web_service/web_service.py).
**Note**: If Op is launched in threaded mode, different threads of the same Op execute `init_op` only once and share `init_op` loaded resources when Op is multi-concurrent.
### preprocess
This method is used to preprocess the data before model prediction. It has an `input_dicts` parameter, `input_dicts` is a dictionary, key is the `name` of the previous Op, and value is the data transferred from the corresponding previous op (the data is also in dictionary format).
The `preprocess` method needs to process the data into a ndarray dictionary (key is the feed variable name, and value is the corresponding ndarray value). Op will take the return value as the input of the model prediction and pass the output to the `postprocess` method.
**Note**: if Op does not have a model configuration file, the return value of `preprocess` will be directly passed to `postprocess`.
### postprocess
This method is used for data post-processing after model prediction. It has two parameters, `input_dicts` and `fetch_dict`.
Where the `input_dicts` parameter is consistent with the parameter in `preprocess` method, and `fetch_dict` is the output of the model prediction (key is the name of the fetch variable, and value is the corresponding ndarray value). Op will take the return value of `postprocess` as the input of subsequent Op `preprocess`.
**Note**: if Op does not have a model configuration file, `fetch_dict` will be the return value of `preprocess`.
Here is the op of the UCI example:
```python
class UciOp(Op):
def init_op(self):
self.separator = ","
def preprocess(self, input_dicts):
(_, input_dict), = input_dicts.items()
x_value = input_dict["x"]
if isinstance(x_value, (str, unicode)):
input_dict["x"] = np.array(
[float(x.strip()) for x in x_value.split(self.separator)])
return input_dict
def postprocess(self, input_dicts, fetch_dict):
fetch_dict["price"] = str(fetch_dict["price"][0][0])
return fetch_dict
```
## WebService base class
Paddle Serving implements the [WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L23) base class. You need to override its `get_pipeline_response` method to define the topological relationship between Ops. The default implementation is as follows:
```python
class WebService(object):
def get_pipeline_response(self, read_op):
return None
```
Where `read_op` serves as the entry point of the topology map of the whole service (that is, the first op defined by the user is followed by `read_op`).
For single Op service (single model), take Uci service as an example (there is only one Uci prediction model in the whole service):
```python
class UciService(WebService):
def get_pipeline_response(self, read_op):
uci_op = UciOp(name="uci", input_ops=[read_op])
return uci_op
```
For multiple Op services (multiple models), take Ocr service as an example (the whole service is completed in series by Det model and Rec model):
```python
class OcrService(WebService):
def get_pipeline_response(self, read_op):
det_op = DetOp(name="det", input_ops=[read_op])
rec_op = RecOp(name="rec", input_ops=[det_op])
return rec_op
```
WebService objects need to load a yaml configuration file through the `prepare_pipeline_config` to configure each Op and the entire service. The simplest configuration file is as follows (Uci example):
```yaml
http_port: 18080
op:
uci:
local_service_conf:
model_config: uci_housing_model # path
```
All field names of yaml file are as follows:
```yaml
rpc_port: 18080 # gRPC port
build_dag_each_worker: false # Whether to use process server or not. The default is false
worker_num: 1 # gRPC thread pool size (the number of processes in the process version servicer). The default is 1
http_port: 0 # HTTP service port. Do not start HTTP service when the value is less or equals 0. The default value is 0.
dag:
is_thread_op: true # Whether to use the thread version of OP. The default is true
client_type: brpc # Use brpc or grpc client. The default is brpc
retry: 1 # The number of times DAG executor retries after failure. The default value is 1, that is, no retrying
use_profile: false # Whether to print the log on the server side. The default is false
tracer:
interval_s: -1 # Monitoring time interval of Tracer (in seconds). Do not start monitoring when the value is less than 1. The default value is -1
op:
<op_name>: # op name, corresponding to the one defined in the program
concurrency: 1 # op concurrency number, the default is 1
timeout: -1 # predict timeout in milliseconds. The default value is -1, that is, no timeout
retry: 1 # timeout retransmissions. The default value is 1, that is, do not try again
batch_size: 1 # If this field is set, Op will merge multiple request outputs into a single batch
auto_batching_timeout: -1 # auto-batching timeout in milliseconds. The default value is -1, that is, no timeout
local_service_conf:
model_config: # the path of the corresponding model file. There is no default value(None). If this item is not configured, the model file will not be loaded.
workdir: "" # working directory of corresponding model
thread_num: 2 # the corresponding model is started with thread_num threads
devices: "" # on which device does the model launched. You can specify the GPU card number(such as "0,1,2"), which is CPU by default
mem_optim: true # mem optimization option, the default is true
ir_optim: false # ir optimization option, the default is false
```
All fields of Op can be defined when Op is created in the program (which will override yaml fields).
# 如何开发一个新的Web Service?
(简体中文|[English](NEW_WEB_SERVICE.md))
本文档将以 Uci 房价预测服务为例,来介绍如何开发一个新的Web Service。您可以在[这里](../python/examples/pipeline/simple_web_service/web_service.py)查阅完整的代码。
## Op 基类
在一些服务中,单个模型可能无法满足需求,需要多个模型串联或并联来完成整个服务。我们将单个模型操作称为 Op,并提供了一套简单的接口来实现 Op 串联或并联的复杂逻辑。
Op 间数据是以字典形式进行传递的,Op 可以以线程或进程方式启动,同时可以对 Op 的并发数等进行配置。
通常情况下,您需要继承 Op 基类,重写它的 `init_op``preprocess``postprocess` 方法,默认实现如下:
```python
class Op(object):
def init_op(self):
pass
def preprocess(self, input_dicts):
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func.")
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict
def postprocess(self, input_dicts, fetch_dict):
return fetch_dict
```
### init_op 方法
该方法用于加载用户自定义资源(如字典等),在 [UciOp](../python/examples/pipeline/simple_web_service/web_service.py) 中加载了一个分隔符。
**注意**:如果 Op 是以线程模式加载的,那么在 Op 多并发时,同种 Op 的不同线程只执行一次 `init_op`,且共用 `init_op` 加载的资源。
### preprocess 方法
该方法用于模型预测前对数据的预处理,它有一个 `input_dicts` 参数,`input_dicts` 是一个字典,key 为前继 Op 的 `name`,value 为对应前继 Op 传递过来的数据(数据同样是字典格式)。
`preprocess` 方法需要将数据处理成 ndarray 字典(key 为 feed 变量名,value 为对应的 ndarray 值),Op 会将该返回值作为模型预测的输入,并将输出传递给 `postprocess` 方法。
**注意**:如果 Op 没有配置模型,则 `preprocess` 的返回值会直接传递给 `postprocess`
### postprocess 方法
该方法用于模型预测后对数据的后处理,它有两个参数,`input_dicts``fetch_dict`
其中,`input_dicts``preprocess` 的参数相同,`fetch_dict` 为模型预测的输出(key 为 fetch 变量名,value 为对应的 ndarray 值)。Op 会将 `postprocess` 的返回值作为后继 Op `preprocess` 的输入。
**注意**:如果 Op 没有配置模型,则 `fetch_dict` 将为 `preprocess` 的返回值。
下面是 Uci 例子的 Op:
```python
class UciOp(Op):
def init_op(self):
self.separator = ","
def preprocess(self, input_dicts):
(_, input_dict), = input_dicts.items()
x_value = input_dict["x"]
if isinstance(x_value, (str, unicode)):
input_dict["x"] = np.array(
[float(x.strip()) for x in x_value.split(self.separator)])
return input_dict
def postprocess(self, input_dicts, fetch_dict):
fetch_dict["price"] = str(fetch_dict["price"][0][0])
return fetch_dict
```
## WebService 基类
Paddle Serving 实现了 [WebService](https://github.com/PaddlePaddle/Serving/blob/develop/python/paddle_serving_server/web_service.py#L28) 基类,您需要重写它的 `get_pipeline_response` 方法来定义 Op 间的拓扑关系,并返回作为 Response 的 Op,默认实现如下:
```python
class WebService(object):
def get_pipeline_response(self, read_op):
return None
```
其中,`read_op` 作为整个服务拓扑图的入口(即用户自定义的第一个 Op 的前继为 `read_op`)。
对于单 Op 服务(单模型),以 Uci 服务为例(整个服务中只有一个 Uci 房价预测模型):
```python
class UciService(WebService):
def get_pipeline_response(self, read_op):
uci_op = UciOp(name="uci", input_ops=[read_op])
return uci_op
```
对于多 Op 服务(多模型),以 Ocr 服务为例(整个服务由 Det 模型和 Rec 模型串联完成):
```python
class OcrService(WebService):
def get_pipeline_response(self, read_op):
det_op = DetOp(name="det", input_ops=[read_op])
rec_op = RecOp(name="rec", input_ops=[det_op])
return rec_op
```
WebService 对象需要通过 `prepare_pipeline_config` 加载一个 yaml 配置文件,用来对各个 Op 以及整个服务进行配置,最简单的配置文件如下(Uci 例子):
```yaml
http_port: 18080
op:
uci:
local_service_conf:
model_config: uci_housing_model # 路径
```
yaml 文件的所有字段名详见下面:
```yaml
rpc_port: 18080 # gRPC端口号
build_dag_each_worker: false # 是否使用进程版 Servicer,默认为 false
worker_num: 1 # gRPC线程池大小(进程版 Servicer 中为进程数),默认为 1
http_port: 0 # HTTP 服务的端口号,若该值小于或等于 0 则不开启 HTTP 服务,默认为 0
dag:
is_thread_op: true # 是否使用线程版Op,默认为 true
client_type: brpc # 使用 brpc 或 grpc client,默认为 brpc
retry: 1 # DAG Executor 在失败后重试次数,默认为 1,即不重试
use_profile: false # 是否在 Server 端打印日志,默认为 false
tracer:
interval_s: -1 # Tracer 监控的时间间隔,单位为秒。当该值小于 1 时不启动监控,默认为 -1
op:
<op_name>: # op 名,与程序中定义的相对应
concurrency: 1 # op 并发数,默认为 1
timeout: -1 # 预测超时时间,单位为毫秒。默认为 -1 即不超时
retry: 1 # 超时重发次数。默认为 1 即不重试
batch_size: 1 # auto-batching 中的 batch_size,若设置该字段则 Op 会将多个请求输出合并为一个 batch
auto_batching_timeout: -1 # auto-batching 超时时间,单位为毫秒。默认为 -1 即不超时
local_service_conf:
model_config: # 对应模型文件的路径,无默认值(None)。若不配置该项则不会加载模型文件。
workdir: "" # 对应模型的工作目录
thread_num: 2 # 对应模型用几个线程启动
devices: "" # 模型启动在哪个设备上,可以指定 gpu 卡号(如 "0,1,2"),默认为 cpu
mem_optim: true # mem 优化选项,默认为 true
ir_optim: false # ir 优化选项,默认为 false
```
其中,Op 的所有字段均可以在程序中创建 Op 时定义(会覆盖 yaml 的字段)。
# 搭建预测服务集群
[客户端配置](../CLIENT_CONFIGURE.md)中我们已经知道,通过在客户端SDK的配置文件predictors.prototxt适当配置,可以搭建多副本和多Variant的预测集群。以下以图像分类任务为例,在单机上模拟搭建单Variant的多副本、和多Variant的预测集群
## 1. 单Variant多副本的预测集群
### 1.1 在本机创建一个serving副本
首先复制一个sering目录
```shell
$ cd /path/to/paddle-serving/build/output/demo
$ cp -r serving/ serving_new/
$ cd serving_new/
```
在serving_new目录中,在conf/gflags.conf中增加如下一行,修改其启动端口为8011,这是为了让该副本监听不同端口
```shell
--port=8011
```
然后启动新副本
```shell
$ bin/serving&
```
### 1.2 修改client端配置,将新副本地址加入ip列表:
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
```
修改conf/predictors.prototxt ImageClassifyService部分如下所示
```JSON
predictors {
name: "ximage"
service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService"
endpoint_router: "WeightedRandomRender"
weighted_random_render_conf {
variant_weight_list: "50"
}
variants {
tag: "var1"
naming_conf {
cluster: "list://127.0.0.1:8010, 127.0.0.1:8011" # 在这里增加一个新的副本地址
}
}
}
```
重启client端
```shell
$ bin/ximage&
```
查看2个serving副本目录下是否均有收到请求:
```shell
$ cd /path/to/paddle-serving/build/output/demo/serving
$ tail -f log/serving.INFO
$ cd /path/to/paddle-serving/build/output/demo/serving_new
$ tail -f log/serving.INFO
```
## 2. 多Variant
### 2.1 本机创建新的serving副本
步骤同1.1节,略过
### 2.2 修改client配置,增加一个Variant
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
```
修改conf/predictors.prototxt ImageClassifyService部分如下所示
```JSON
predictors {
name: "ximage"
service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService"
endpoint_router: "WeightedRandomRender"
weighted_random_render_conf {
variant_weight_list: "50 | 50" # 一共2个variant,代表模型的2个版本。这里的权重代表调度的流量比例关系
}
variants {
tag: "var1"
naming_conf {
cluster: "list://127.0.0.1:8010"
}
}
variants { # 增加一个variant
tag: "var2"
naming_conf {
cluster: "list://127.0.0.1:8011"
}
}
}
```
重启client端
```shell
$ bin/ximage&
```
查看2个serving副本目录下是否均有收到请求:
```shell
$ cd /path/to/paddle-serving/build/output/demo/serving
$ tail -f log/serving.INFO
$ cd /path/to/paddle-serving/build/output/demo/serving_new
$ tail -f log/serving.INFO
```
查看client端是否有收到来自Variant1和Variant2的响应
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
$ tail -f log/ximage.INFO
```
以下是正常的输出
```
I0307 17:54:22.862087 24719 ximage.cpp:172] Debug string:
I0307 17:54:22.862650 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815
I0307 17:54:22.862666 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var1, elapse_ms: 333
I0307 17:54:23.194780 24719 ximage.cpp:172] Debug string:
I0307 17:54:23.195322 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815
I0307 17:54:23.195334 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var2, elapse_ms: 332
```
# CTR预估模型
## 1. 背景
在搜索、推荐、在线广告等业务场景中,embedding参数的规模常常非常庞大,达到数百GB甚至T级别;训练如此规模的模型需要用到多机分布式训练能力,将参数分片更新和保存;另一方面,训练好的模型,要应用于在线业务,也难以单机加载。Paddle Serving提供大规模稀疏参数读写服务,用户可以方便地将超大规模的稀疏参数以kv形式托管到参数服务,在线预测只需将所需要的参数子集从参数服务读取回来,再执行后续的预测流程。
我们以CTR预估模型为例,演示Paddle Serving中如何使用大规模稀疏参数服务。关于模型细节请参考[原始模型](https://github.com/PaddlePaddle/models/tree/v1.5/PaddleRec/ctr)
根据[对数据集的描述](https://www.kaggle.com/c/criteo-display-ad-challenge/data),该模型原始输入为13维integer features和26维categorical features。在我们的模型中,13维integer feature作为dense feature整体feed到一个data layer,而26维categorical features各自作为一个feature分别feed到一个data layer。除此之外,为计算auc指标,还将label作为一个feature输入。
若按缺省训练参数,本模型的embedding dim为100w,size为10,也就是参数矩阵为1000000 x 10的float型矩阵,实际占用内存共1000000 x 10 x sizeof(float) = 39MB;**实际场景中,embedding参数要大的多;因此该demo仅为演示使用**
## 2. 模型裁剪
在写本文档时([v1.5](https://github.com/PaddlePaddle/models/tree/v1.5)),训练脚本用PaddlePaddle py_reader加速样例读取速度,program中带有py_reader相关OP,且训练过程中只保存了模型参数,没有保存program,保存的参数没法直接用预测库加载;另外原始网络中最终输出的tensor是auc和batch_auc,而实际模型用于预测时只需要每个样例的predict,需要改掉模型的输出tensor为predict。再有,为了演示稀疏参数服务的使用,我们要有意将embedding layer包含的lookup_table OP从预测program中拿掉,以embedding layer的output variable作为网络的输入,然后再添加对应的feed OP,使得我们能够在预测时从稀疏参数服务获取到embedding向量后,将数据直接feed到各个embedding的output variable。
基于以上几方面考虑,我们需要对原始program进行裁剪。大致过程为:
1) 去掉py_reader相关代码,改为用fluid自带的reader和DataFeed
2) 修改原始网络配置,将predict变量作为fetch target
3) 修改原始网络配置,将26个稀疏参数的embedding layer的output作为feed target,以与后续稀疏参数服务配合使用
4) 修改后的网络,本地train 1个batch后,调用`fluid.io.save_inference_model()`,获得裁剪后的模型program
5) 裁剪后的program,用python再次处理,去掉embedding layer的lookup_table OP。这是因为,当前Paddle Fluid在第4步`save_inference_model()`时没有裁剪干净,还保留了embedding的lookup_table OP;如果这些OP不去除掉,那么embedding的output variable就会有2个输入OP:一个是feed OP(我们要添加的),一个是lookup_table;而lookup_table又没有输入,它的输出会与feed OP的输出互相覆盖,导致错乱。另外网络中还保留了SparseFeatFactors这个variable(全局共享的embedding矩阵对应的变量),这个variable也要去掉,否则网络加载时还会尝试从磁盘读取embedding参数,就失去了我们这个demo的意义。
6) 第4步拿到的program,与分布式训练保存的模型参数(除embedding之外)保存到一起,形成完整的预测模型
第1) - 第5)步裁剪完毕后的模型网络配置如下:
![Pruned CTR prediction network](../images/pruned-ctr-network.png)
整个裁剪过程具体说明如下:
### 2.1 网络配置中去除py_reader
Inference program调用ctr_dnn_model()函数时添加`user_py_reader=False`参数。这会在ctr_dnn_model定义中将py_reader相关的代码去掉
修改前:
```python
def train():
args = parse_args()
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim)
...
```
修改后:
```python
def train():
args = parse_args()
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim, use_py_reader=False)
...
```
### 2.2 网络配置中修改feed targets和fetch targets
如第2节开头所述,为了使program适合于演示稀疏参数的使用,我们要裁剪program,将`ctr_dnn_model`中feed variable list和fetch variable分别改掉:
1) Inference program中26维稀疏特征的输入改为每个特征的embedding layer的output variable
2) fetch targets中返回的是predict,取代auc_var和batch_auc_var
截至写本文时,原始的网络配置 (network_conf.py中)`ctr_dnn_model`定义如下:
```python
def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
def embedding_layer(input):
emb = fluid.layers.embedding(
input=input,
is_sparse=True,
# you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
# if you want to set is_distributed to True
is_distributed=False,
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors",
initializer=fluid.initializer.Uniform()))
return fluid.layers.sequence_pool(input=emb, pool_type='average') # 需修改1
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64')
for i in range(1, 27)]
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
words = [dense_input] + sparse_input_ids + [label]
py_reader = None
if use_py_reader:
py_reader = fluid.layers.create_py_reader_by_data(capacity=64,
feed_list=words,
name='py_reader',
use_double_buffer=True)
words = fluid.layers.read_file(py_reader)
sparse_embed_seq = list(map(embedding_layer, words[1:-1])) # 需修改2
concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1)
fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(concated.shape[1]))))
fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc1.shape[1]))))
fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc2.shape[1]))))
predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc3.shape[1]))))
cost = fluid.layers.cross_entropy(input=predict, label=words[-1])
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=words[-1])
auc_var, batch_auc_var, auc_states = \
fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20)
return avg_cost, auc_var, batch_auc_var, py_reader, words # 需修改3
```
修改后
```python
def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
def embedding_layer(input):
emb = fluid.layers.embedding(
input=input,
is_sparse=True,
# you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
# if you want to set is_distributed to True
is_distributed=False,
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors",
initializer=fluid.initializer.Uniform()))
seq = fluid.layers.sequence_pool(input=emb, pool_type='average')
return emb, seq # 对应上文修改处1
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64')
for i in range(1, 27)]
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
words = [dense_input] + sparse_input_ids + [label]
sparse_embed_and_seq = list(map(embedding_layer, words[1:-1]))
emb_list = [x[0] for x in sparse_embed_and_seq] # 对应上文修改处2
sparse_embed_seq = [x[1] for x in sparse_embed_and_seq]
concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1)
train_feed_vars = words # 对应上文修改处2
inference_feed_vars = emb_list + words[0:1]
fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(concated.shape[1]))))
fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc1.shape[1]))))
fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc2.shape[1]))))
predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc3.shape[1]))))
cost = fluid.layers.cross_entropy(input=predict, label=words[-1])
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=words[-1])
auc_var, batch_auc_var, auc_states = \
fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20)
fetch_vars = [predict]
# 对应上文修改处3
return avg_cost, auc_var, batch_auc_var, train_feed_vars, inference_feed_vars, fetch_vars
```
说明:
1) 修改处1,我们将embedding layer的输出变量返回
2) 修改处2,我们将embedding layer的输出变量保存到`emb_list`,后者进一步保存到`inference_feed_vars`,用来将来在`save_inference_model()`时指定feed variable list。
3) 修改处3,我们将`words`变量作为训练时的feed variable list (`train_feed_vars`),将embedding layer的output variable作为infer时的feed variable list (`inference_feed_vars`),将`predict`作为fetch target (`fetch_vars`),分别返回。`inference_feed_vars``fetch_vars`用于`fluid.io.save_inference_model()`时指定feed variable list和fetch target list
### 2.3 fluid.io.save_inference_model()保存裁剪后的program
`fluid.io.save_inference_model()`不仅保存模型参数,还能够根据feed variable list和fetch target list参数,对program进行裁剪,形成适合inference用的program。大致原理是,根据前向网络配置,从fetch target list开始,反向查找其所依赖的OP列表,并将每个OP的输入加入目标variable list,再次递归地反向找到所有依赖OP和variable list。
在2.2节中我们已经拿到所需的`inference_feed_vars``fetch_vars`,接下来只要在训练过程中每次保存模型参数时改为调用`fluid.io.save_inference_model()`
修改前:
```python
def train_loop(args, train_program, py_reader, loss, auc_var, batch_auc_var,
trainer_num, trainer_id):
...省略
for pass_id in range(args.num_passes):
pass_start = time.time()
batch_id = 0
py_reader.start()
try:
while True:
loss_val, auc_val, batch_auc_val = pe.run(fetch_list=[loss.name, auc_var.name, batch_auc_var.name])
loss_val = np.mean(loss_val)
auc_val = np.mean(auc_val)
batch_auc_val = np.mean(batch_auc_val)
logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}"
.format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val))
if batch_id % 1000 == 0 and batch_id != 0:
model_dir = args.model_output_dir + '/batch-' + str(batch_id)
if args.trainer_id == 0:
fluid.io.save_persistables(executor=exe, dirname=model_dir,
main_program=fluid.default_main_program())
batch_id += 1
except fluid.core.EOFException:
py_reader.reset()
print("pass_id: %d, pass_time_cost: %f" % (pass_id, time.time() - pass_start))
...省略
```
修改后
```python
def train_loop(args,
train_program,
train_feed_vars,
inference_feed_vars, # 裁剪program用的feed variable list
fetch_vars, # 裁剪program用的fetch variable list
loss,
auc_var,
batch_auc_var,
trainer_num,
trainer_id):
# 因为已经将py_reader去掉,这里用fluid自带的DataFeeder
dataset = reader.CriteoDataset(args.sparse_feature_dim)
train_reader = paddle.batch(
paddle.reader.shuffle(
dataset.train([args.train_data_path], trainer_num, trainer_id),
buf_size=args.batch_size * 100),
batch_size=args.batch_size)
inference_feed_var_names = [var.name for var in inference_feed_vars]
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
total_time = 0
pass_id = 0
batch_id = 0
feed_var_names = [var.name for var in feed_vars]
feeder = fluid.DataFeeder(feed_var_names, place)
for data in train_reader():
loss_val, auc_val, batch_auc_val = exe.run(fluid.default_main_program(),
feed = feeder.feed(data),
fetch_list=[loss.name, auc_var.name, batch_auc_var.name])
fluid.io.save_inference_model(model_dir,
inference_feed_var_names,
fetch_vars,
exe,
fluid.default_main_program())
break # 我们只要裁剪后的program,不需要模型参数,因此只train一个batch就停止了
loss_val = np.mean(loss_val)
auc_val = np.mean(auc_val)
batch_auc_val = np.mean(batch_auc_val)
logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}"
.format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val))
```
### 2.4 用python再次处理inference program,去除lookup_table OP和SparseFeatFactors变量
这一步是因为`fluid.io.save_inference_model()`裁剪出的program没有将lookup_table OP去除。未来如果`save_inference_model`接口完善,本节可跳过
主要代码:
```python
def prune_program():
args = parse_args()
# 从磁盘打开网络配置文件并反序列化成protobuf message
model_dir = args.model_output_dir + "/inference_only"
model_file = model_dir + "/__model__"
with open(model_file, "rb") as f:
protostr = f.read()
f.close()
proto = framework_pb2.ProgramDesc.FromString(six.binary_type(protostr))
# 去除lookup_table OP
block = proto.blocks[0]
kept_ops = [op for op in block.ops if op.type != "lookup_table"]
del block.ops[:]
block.ops.extend(kept_ops)
# 去除SparseFeatFactors var
kept_vars = [var for var in block.vars if var.name != "SparseFeatFactors"]
del block.vars[:]
block.vars.extend(kept_vars)
# 写回磁盘文件
with open(model_file + ".pruned", "wb") as f:
f.write(proto.SerializePartialToString())
f.close()
with open(model_file + ".prototxt.pruned", "w") as f:
f.write(text_format.MessageToString(proto))
f.close()
```
### 2.5 裁剪过程串到一起
我们提供了完整的裁剪CTR预估模型的脚本文件save_program.py,同[CTR分布式训练和Serving流程化部署](https://github.com/PaddlePaddle/Serving/blob/master/doc/DEPLOY.md)一起发布,可以在trainer和pserver容器的训练脚本目录下找到,也可以在[这里](https://github.com/PaddlePaddle/Serving/tree/master/doc/resource)下载。
## 3. 整个预测计算流程
Client端:
1) Dense feature: 从dataset每条样例读取13个integer features,形成1个dense feature
2) Sparse feature: 从dataset每条样例读取26个categorical feature,分别经过hash(str(feature_index) + feature_string)签名,得到每个feature的id,形成26个sparse feature
Serving端:
1) Dense feature: dense feature共13个float型数字,一起feed到网络dense_input这个variable对应的LodTensor
2) Sparse feature: 26个sparse feature id,分别访问kv服务获取对应的embedding向量,feed到对应的26个embedding layer的output variable。在我们裁剪出来的网络中,这些variable分别对应的变量名为embedding_0.tmp_0, embedding_1.tmp_0, ... embedding_25.tmp_0
3) 执行预测,获取预测结果。
# FAQ
## 1. 如何修改端口配置?
使用该框架搭建的服务需要申请一个端口,可以通过以下方式修改端口号:
- 如果在inferservice_file里指定了port:xxx,那么就去申请该端口号;
- 否则,如果在gflags.conf里指定了--port:xxx,那就去申请该端口号;
- 否则,使用程序里指定的默认端口号:8010。
## 2. GPU预测中为何请求的响应时间波动会非常大?
PaddleServing依托PaddlePaddle预测库执行预测计算;在GPU设备上,由于同一个进程内目前共用1个GPU stream,进程内的多个请求的预测计算会被严格串行。所以如果有2个请求同时到达某个Serving实例,不管该实例启动时创建了多少个worker线程,都不能起到加速作用,后到的请求会被排队,直到前面请求计算完成。
## 3. 如何充分利用GPU卡的计算能力?
如问题2所说,由于预测库的限制,单个Serving进程只能绑定单张GPU卡,且进程内共用1个GPU stream,所有请求必须串行计算。
为提高GPU卡使用率,目前可以想到的方法是:在单张GPU卡上启动多个Serving进程,每个进程绑定一个GPU stream,多个stream并行计算。这种方法是否能起到加速作用,受限于多个因素,主要有:
1. 单个stream占用GPU算力;假如单个stream已经将GPU算力占用超过50%,那么增加stream很可能会导致2个stream的job分别排队,拖慢各自的响应时间
2. GPU显存:Serving进程需要将模型参数加载到显存中,并且计算时要在GPU显存池分配临时变量;假如单个Serving进程已经用掉超过50%的显存,则增加Serving进程会造成显存不足,导致进程报错退出
为此,可采用如下步骤,进行测试:
1. 加载模型时,在model_toolkit.prototxt中,model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR;会对模型进行静态分析,进行一定程度显存优化
2. 在步骤1完成后,启动单个Serving进程,启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`;启动一个client,进行并发度为1的压力测试,batch size从小到大,记下平响;由于算力的限制,当batch size增大到一定程度,应该会出现响应时间明显变大;或虽然没有明显变大,但已经不满足系统需求
3. 再启动1个Serving进程,与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口;然后同时对这2个Serving进程进行压测,继续观察batch size从小到大时平均响应时间的变化,直到取得batch size和响应时间的折中
4. 重复步骤2-3
5. 以2-4步的测试,来决定:单张GPU卡可以由多少个Serving进程共用; 实际部署时,就在一张GPU卡上启动这么多个Serving进程同时提供服务
# HTTP Inferface
Paddle Serving服务均可以通过HTTP接口访问,客户端只需按照Service定义的Request消息格式构造json字符串即可。客户端构造HTTP请求,将json格式数据以POST请求发给serving端,serving端**自动**按Service定义的Protobuf消息格式,将json数据转换成protobuf消息。
本文档介绍以python和PHP语言访问Serving的HTTP服务接口的用法。
## 1. 访问地址
访问Serving节点的HTTP服务与C++服务使用同一个端口(例如8010),访问URL规则为:
```
http://127.0.0.1:8010/ServiceName/inference
http://127.0.0.1:8010/ServiceName/debug
```
其中ServiceName应该与Serving的配置文件`conf/services.prototxt`中配置的一致,假如有如下2个service:
```protobuf
services {
name: "BuiltinTestEchoService"
workflows: "workflow3"
}
services {
name: "TextClassificationService"
workflows: "workflow6"
}
```
则访问上述2个Serving服务的HTTP URL分别为:
```
http://127.0.0.1:8010/BuiltinTestEchoService/inference
http://127.0.0.1:8010/BuiltinTestEchoService/debug
http://127.0.0.1:8010/TextClassificationService/inference
http://127.0.0.1:8010/TextClassificationService/debug
```
## 2. Python访问HTTP Serving
Python语言访问HTTP Serving,关键在于构造json格式的请求数据,可以通过以下步骤完成:
1) 按照Service定义的Request消息格式构造python object
2) `json.dump()` / `json.dumps()` 等函数将python object转换成json格式字符串
以TextClassificationService为例,关键代码如下:
```python
# Connect to server
conn = httplib.HTTPConnection("127.0.0.1", 8010)
# samples是一个list,其中每个元素是一个ids字典:
# samples[0] = [190, 1, 70, 382, 914, 5146, 190...]
for i in range(0, len(samples) - BATCH_SIZE, BATCH_SIZE):
# 构建批量预测数据
batch = samples[i: i + BATCH_SIZE]
ids = []
for x in batch:
ids.append({"ids" : x})
ids = {"instances": ids}
# python object转成json
request_json = json.dumps(ids)
# 请求HTTP服务,打印response
try:
conn.request('POST', "/TextClassificationService/inference", request_json, {"Content-Type": "application/json"})
response = conn.getresponse()
print response.read()
except httplib.HTTPException as e:
print e.reason
```
完整示例请参考[text_classification.py](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/python/text_classification.py)
## 3. PHP访问HTTP Serving
PHP语言构造json格式字符串的步骤如下:
1) 按照Service定义的Request消息格式,构造PHP array
2) `json_encode()`函数将PHP array转换成json字符串
以TextCLassificationService为例,关键代码如下:
```PHP
function http_post(&$ch, $data) {
// array to json string
$data_string = json_encode($data);
// post data 封装
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
// set header
curl_setopt($ch,
CURLOPT_HTTPHEADER,
array(
'Content-Length: ' . strlen($data_string)
)
);
// 执行
$result = curl_exec($ch);
return $result;
}
$ch = &http_connect('http://127.0.0.1:8010/TextClassificationService/inference');
$count = 0;
# $samples是一个2层array,其中每个元素是一个如下array:
# $samples[0] = array(
# "ids" => array(
# [0] => int(190),
# [1] => int(1),
# [2] => int(70),
# [3] => int(382),
# [4] => int(914),
# [5] => int(5146),
# [6] => int(190)...)
# )
for ($i = 0; $i < count($samples) - BATCH_SIZE; $i += BATCH_SIZE) {
$instances = array_slice($samples, $i, BATCH_SIZE);
echo http_post($ch, array("instances" => $instances)) . "\n";
}
curl_close($ch);
```
完整代码请参考[text_classification.php](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/php/text_classification.php)
# Model Zoo
本页面展示了Paddle Serving目前支持的预训练模型以及下载链接
若您想为Paddle Serving提供新的模型,可通过[pull requese](https://github.com/PaddlePaddle/Serving/pulls)提交PR
*特别感谢[PadddlePaddle 全链条](https://www.paddlepaddle.org.cn/wholechain)以及[PaddleHub](https://www.paddlepaddle.org.cn/hub)为Paddle Serving提供的部分预训练模型*
| Model | Type | Dataset | Size | Download | Sample Input| Model mode |
| --- | --- | --- | --- | --- | --- | --- |
| ResNet_V2_50 | PaddleClas | ImageNet | 90.78 MB | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | [daisy.jpg](../examples/PaddleClas/resnet_v2_50/daisy.jpg) |Eager|
| MobileNet_v2 | PaddleClas | ImageNet | 8.06 MB | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) | [daisy.jpg](../examples/PaddleClas/mobilenet/daisy.jpg) |Eager|
| Bert | PaddleNLP | zhwiki | 361.96 MB | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) | [data-c.txt](../examples/PaddleNLP/data-c.txt) |Eager|
| Senta | PaddleNLP | Baidu | 578.37 MB | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) | |Eager|
| Squeezenet 1_1 | Image Classification | ImageNet | 4.4 MB | [.mar](https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Eager|
| MNIST digit classifier | Image Classification | MNIST | 4.3 MB | [.mar](https://torchserve.pytorch.org/mar_files/mnist_v2.mar) | [0.png](../examples/image_classifier/mnist/test_data/0.png) |Eager|
| Resnet 152 |Image Classification | ImageNet | 214 MB | [.mar](https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Eager|
| Faster RCNN | Object Detection | COCO | 148 MB | [.mar](https://torchserve.pytorch.org/mar_files/fastrcnn.mar) | [persons.jpg](../examples/object_detector/persons.jpg) |Eager|
| MASK RCNN | Object Detection | COCO | 158 MB | [.mar](https://torchserve.pytorch.org/mar_files/maskrcnn.mar) | [persons.jpg](../examples/object_detector/persons.jpg) |Eager|
| Text classifier | Text Classification | AG_NEWS | 169 MB | [.mar](https://torchserve.pytorch.org/mar_files/my_text_classifier_v4.mar) | [sample_text.txt](../examples/text_classification/sample_text.txt) |Eager|
| FCN ResNet 101 | Image Segmentation | COCO | 193 MB | [.mar](https://torchserve.pytorch.org/mar_files/fcn_resnet_101.mar) | [persons.jpg](../examples/image_segmenter/persons.jpg) |Eager|
| DeepLabV3 ResNet 101 | Image Segmentation | COCO | 217 MB | [.mar](https://torchserve.pytorch.org/mar_files/deeplabv3_resnet_101_eager.mar) | [persons.jpg](https://github.com/pytorch/serve/blob/master/examples/image_segmenter/persons.jpg) |Eager|
| AlexNet Scripted | Image Classification | ImageNet | 216 MB | [.mar](https://torchserve.pytorch.org/mar_files/alexnet_scripted.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| Densenet161 Scripted| Image Classification | ImageNet | 105 MB | [.mar](https://torchserve.pytorch.org/mar_files/densenet161_scripted.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| Resnet18 Scripted| Image Classification | ImageNet | 42 MB | [.mar](https://torchserve.pytorch.org/mar_files/resnet-18_scripted.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| VGG16 Scripted| Image Classification | ImageNet | 489 MB | [.mar](https://torchserve.pytorch.org/mar_files/vgg16_scripted.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| Squeezenet 1_1 Scripted | Image Classification | ImageNet | 4.4 MB | [.mar](https://torchserve.pytorch.org/mar_files/squeezenet1_1_scripted.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| MNIST digit classifier Scripted | Image Classification | MNIST | 4.3 MB | [.mar](https://torchserve.pytorch.org/mar_files/mnist_scripted_v2.mar) | [0.png](../examples/image_classifier/mnist/test_data/0.png) |Torchscripted |
| Resnet 152 Scripted |Image Classification | ImageNet | 215 MB | [.mar](https://torchserve.pytorch.org/mar_files/resnet-152-scripted_v2.mar) | [kitten.jpg](../examples/image_classifier/kitten.jpg) |Torchscripted |
| Text classifier Scripted | Text Classification | AG_NEWS | 169 MB | [.mar](https://torchserve.pytorch.org/mar_files/my_text_classifier_scripted_v3.mar) | [sample_text.txt](../examples/text_classification/sample_text.txt) |Torchscripted |
| FCN ResNet 101 Scripted | Image Segmentation | COCO | 193 MB | [.mar](https://torchserve.pytorch.org/mar_files/fcn_resnet_101_scripted.mar) | [persons.jpg](../examples/image_segmenter/persons.jpg) |Torchscripted |
| DeepLabV3 ResNet 101 Scripted | Image Segmentation | COCO | 217 MB | [.mar](https://torchserve.pytorch.org/mar_files/deeplabv3_resnet_101_scripted.mar) | [persons.jpg](https://github.com/pytorch/serve/blob/master/examples/image_segmenter/persons.jpg) |Torchscripted |
Refer [example](../examples) for more details on above models.
...@@ -38,7 +38,7 @@ do ...@@ -38,7 +38,7 @@ do
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_use.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_use.log >> profile_log_$1
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1
rm -rf gpu_use.log gpu_utilization.log rm -rf gpu_use.log gpu_utilization.log
$PYTHONROOT/bin/python ../util/show_profile.py profile $thread_num >> profile_log $PYTHONROOT/bin/python ../../../util/show_profile.py profile $thread_num >> profile_log
tail -n 8 profile >> profile_log tail -n 8 profile >> profile_log
echo "" >> profile_log_$1 echo "" >> profile_log_$1
done done
......
...@@ -46,7 +46,7 @@ do ...@@ -46,7 +46,7 @@ do
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1
awk -F" " '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$1 awk -F" " '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$1
rm -rf gpu_memory_use.log gpu_utilization.log gpu_utilization.log.tmp rm -rf gpu_memory_use.log gpu_utilization.log gpu_utilization.log.tmp
python3.6 ../util/show_profile.py profile $thread_num >> profile_log_$1 python3.6 ../../../util/show_profile.py profile $thread_num >> profile_log_$1
tail -n 10 profile >> profile_log_$1 tail -n 10 profile >> profile_log_$1
echo "" >> profile_log_$1 echo "" >> profile_log_$1
done done
......
...@@ -4,11 +4,11 @@ ...@@ -4,11 +4,11 @@
### Introduction ### Introduction
PaddleDetection flying paddle target detection development kit is designed to help developers complete the whole development process of detection model formation, training, optimization and deployment faster and better. For details, see [Github](https://github.com/PaddlePaddle/PaddleDetection/tree/master/dygraph) PaddleDetection flying paddle target detection development kit is designed to help developers complete the whole development process of detection model formation, training, optimization and deployment faster and better. For details, see [Github](https://github.com/PaddlePaddle/PaddleDetection/tree/master)
This article mainly introduces the deployment of Paddle Detection's dynamic graph model on Serving. This article mainly introduces the deployment of Paddle Detection's dynamic graph model on Serving.
Paddle Detection provides a large number of [Model Zoo](https://github.com/PaddlePaddle/PaddleDetection/blob/master/dygraph/docs/MODEL_ZOO_cn.md), these model libraries can be used in Paddle Serving with export tools Model. For the export tutorial, please refer to [Paddle Detection Export Model Tutorial (Simplified Chinese)](https://github.com/PaddlePaddle/PaddleDetection/blob/master/dygraph/deploy/EXPORT_MODEL.md). Paddle Detection provides a large number of [Model Zoo](https://github.com/PaddlePaddle/PaddleDetection/blob/master/docs/MODEL_ZOO_cn.md), these model libraries can be used in Paddle Serving with export tools Model. For the export tutorial, please refer to [Paddle Detection Export Model Tutorial (Simplified Chinese)](https://github.com/PaddlePaddle/PaddleDetection/blob/master/deploy/EXPORT_MODEL.md).
### Serving example ### Serving example
Several examples of PaddleDetection models used in Serving are given in this folder Several examples of PaddleDetection models used in Serving are given in this folder
......
...@@ -4,13 +4,13 @@ ...@@ -4,13 +4,13 @@
### 简介 ### 简介
PaddleDetection飞桨目标检测开发套件,旨在帮助开发者更快更好地完成检测模型的组建、训练、优化及部署等全开发流程。详情参见[Github](https://github.com/PaddlePaddle/PaddleDetection/tree/master/dygraph) PaddleDetection飞桨目标检测开发套件,旨在帮助开发者更快更好地完成检测模型的组建、训练、优化及部署等全开发流程。详情参见[Github](https://github.com/PaddlePaddle/PaddleDetection/tree/master)
本文主要是介绍Paddle Detection的动态图模型在Serving上的部署。 本文主要是介绍Paddle Detection的动态图模型在Serving上的部署。
### 导出模型 ### 导出模型
Paddle Detection提供了大量的[模型库](https://github.com/PaddlePaddle/PaddleDetection/blob/master/dygraph/docs/MODEL_ZOO_cn.md), 这些模型库配合导出工具都可以得到可以用于Paddle Serving的模型。导出教程参见[Paddle Detection模型导出教程](https://github.com/PaddlePaddle/PaddleDetection/blob/master/dygraph/deploy/EXPORT_MODEL.md) Paddle Detection提供了大量的[模型库](https://github.com/PaddlePaddle/PaddleDetection/blob/master/docs/MODEL_ZOO_cn.md), 这些模型库配合导出工具都可以得到可以用于Paddle Serving的模型。导出教程参见[Paddle Detection模型导出教程](https://github.com/PaddlePaddle/PaddleDetection/blob/master/deploy/EXPORT_MODEL.md)
### Serving示例 ### Serving示例
本文件夹下给出了多个PaddleDetection模型用于Serving的范例 本文件夹下给出了多个PaddleDetection模型用于Serving的范例
......
...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import * ...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(), DetectionResize(
DetectionResize((800, 1333), True, interpolation=2), (800, 1333), True, interpolation=2),
DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True), DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True),
DetectionTranspose((2,0,1)), DetectionTranspose((2, 0, 1)), DetectionPadStride(32)
DetectionPadStride(32)
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
......
...@@ -19,12 +19,11 @@ from paddle_serving_app.reader import * ...@@ -19,12 +19,11 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(),
DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True), DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True),
DetectionResize( DetectionResize(
(800, 1333), True, interpolation=cv2.INTER_LINEAR), (800, 1333), True, interpolation=cv2.INTER_LINEAR), DetectionTranspose(
DetectionTranspose((2,0,1)), (2, 0, 1)), DetectionPadStride(128)
DetectionPadStride(128)
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
......
...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import * ...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(),
DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True), DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True),
DetectionResize( DetectionResize(
(608, 608), False, interpolation=2), (608, 608), False, interpolation=2), DetectionTranspose((2, 0, 1))
DetectionTranspose((2,0,1))
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
......
...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import * ...@@ -19,11 +19,10 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(),
DetectionResize( DetectionResize((300, 300), False, interpolation=cv2.INTER_LINEAR),
(300, 300), False, interpolation=cv2.INTER_LINEAR), DetectionNormalize([104.0, 117.0, 123.0], [1.0, 1.0, 1.0], False),
DetectionNormalize([104.0, 117.0, 123.0], [1.0, 1.0, 1.0], False), DetectionTranspose((2, 0, 1)),
DetectionTranspose((2,0,1)),
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
......
...@@ -18,11 +18,10 @@ from paddle_serving_app.reader import * ...@@ -18,11 +18,10 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(), DetectionResize(
DetectionResize(
(512, 512), False, interpolation=cv2.INTER_LINEAR), (512, 512), False, interpolation=cv2.INTER_LINEAR),
DetectionNormalize([123.675, 116.28, 103.53], [58.395, 57.12, 57.375], False), DetectionNormalize([123.675, 116.28, 103.53], [58.395, 57.12, 57.375],
DetectionTranspose((2,0,1)) False), DetectionTranspose((2, 0, 1))
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
...@@ -33,7 +32,6 @@ client.connect(['127.0.0.1:9494']) ...@@ -33,7 +32,6 @@ client.connect(['127.0.0.1:9494'])
im, im_info = preprocess(sys.argv[1]) im, im_info = preprocess(sys.argv[1])
fetch_map = client.predict( fetch_map = client.predict(
feed={ feed={
"image": im, "image": im,
......
...@@ -19,11 +19,11 @@ from paddle_serving_app.reader import * ...@@ -19,11 +19,11 @@ from paddle_serving_app.reader import *
import cv2 import cv2
preprocess = DetectionSequential([ preprocess = DetectionSequential([
DetectionFile2Image(), DetectionFile2Image(),
DetectionResize( DetectionResize(
(608, 608), False, interpolation=2), (608, 608), False, interpolation=2),
DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True), DetectionNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], True),
DetectionTranspose((2,0,1)), DetectionTranspose((2, 0, 1)),
]) ])
postprocess = RCNNPostprocess("label_list.txt", "output") postprocess = RCNNPostprocess("label_list.txt", "output")
......
...@@ -43,7 +43,7 @@ do ...@@ -43,7 +43,7 @@ do
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1
rm -rf gpu_use.log gpu_utilization.log rm -rf gpu_use.log gpu_utilization.log
$PYTHONROOT/bin/python3 ../util/show_profile.py profile $thread_num >> profile_log_$1 $PYTHONROOT/bin/python3 ../../../util/show_profile.py profile $thread_num >> profile_log_$1
tail -n 8 profile >> profile_log_$1 tail -n 8 profile >> profile_log_$1
echo "" >> profile_log_$1 echo "" >> profile_log_$1
done done
......
...@@ -101,7 +101,7 @@ python3 rec_web_client.py ...@@ -101,7 +101,7 @@ python3 rec_web_client.py
## C++ OCR Service ## C++ OCR Service
**Notice:** If you need to concatenate det model and rec model, and do pre-processing and post-processing in Paddle Serving C++ framework, you need to use the C++ server compiled with WITH_OPENCV option,see the [COMPILE.md](../../../doc/COMPILE.md) **Notice:** If you need to concatenate det model and rec model, and do pre-processing and post-processing in Paddle Serving C++ framework, you need to use the C++ server compiled with WITH_OPENCV option,see the [COMPILE.md](../../../../doc/Compile_EN.md)
### Start Service ### Start Service
Select a startup mode according to CPU / GPU device Select a startup mode according to CPU / GPU device
......
...@@ -100,7 +100,7 @@ python3 rec_web_client.py ...@@ -100,7 +100,7 @@ python3 rec_web_client.py
``` ```
## C++ OCR Service服务 ## C++ OCR Service服务
**注意:** 若您需要使用Paddle Serving C++框架串联det模型和rec模型,并进行前后处理,您需要使用开启WITH_OPENCV选项编译的C++ Server,详见[COMPILE.md](../../../doc/COMPILE.md) **注意:** 若您需要使用Paddle Serving C++框架串联det模型和rec模型,并进行前后处理,您需要使用开启WITH_OPENCV选项编译的C++ Server,详见[COMPILE.md](../../../../doc/Compile_CN.md)
### 启动服务 ### 启动服务
根据CPU/GPU设备选择一种启动方式 根据CPU/GPU设备选择一种启动方式
......
...@@ -31,14 +31,18 @@ client.connect(["127.0.0.1:9293"]) ...@@ -31,14 +31,18 @@ client.connect(["127.0.0.1:9293"])
import paddle import paddle
test_img_dir = "imgs/" test_img_dir = "imgs/"
def cv2_to_base64(image): def cv2_to_base64(image):
return base64.b64encode(image) #data.tostring()).decode('utf8') return base64.b64encode(image) #data.tostring()).decode('utf8')
for img_file in os.listdir(test_img_dir): for img_file in os.listdir(test_img_dir):
with open(os.path.join(test_img_dir, img_file), 'rb') as file: with open(os.path.join(test_img_dir, img_file), 'rb') as file:
image_data = file.read() image_data = file.read()
image = cv2_to_base64(image_data) image = cv2_to_base64(image_data)
fetch_map = client.predict( fetch_map = client.predict(
feed={"image": image}, fetch = ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"], batch=True) feed={"image": image},
fetch=["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"],
batch=True)
#print("{} {}".format(fetch_map["price"][0], data[0][1][0])) #print("{} {}".format(fetch_map["price"][0], data[0][1][0]))
print(fetch_map) print(fetch_map)
...@@ -4,6 +4,6 @@ do ...@@ -4,6 +4,6 @@ do
$PYTHONROOT/bin/python benchmark.py --thread $thread_num --model ctr_client_conf/serving_client_conf.prototxt --request rpc > profile 2>&1 $PYTHONROOT/bin/python benchmark.py --thread $thread_num --model ctr_client_conf/serving_client_conf.prototxt --request rpc > profile 2>&1
echo "========================================" echo "========================================"
echo "batch size : $batch_size" >> profile_log echo "batch size : $batch_size" >> profile_log
$PYTHONROOT/bin/python ../util/show_profile.py profile $thread_num >> profile_log $PYTHONROOT/bin/python ../../../util/show_profile.py profile $thread_num >> profile_log
tail -n 1 profile >> profile_log tail -n 1 profile >> profile_log
done done
...@@ -6,7 +6,7 @@ do ...@@ -6,7 +6,7 @@ do
$PYTHONROOT/bin/python benchmark_batch.py --thread $thread_num --batch_size $batch_size --model serving_client_conf/serving_client_conf.prototxt --request rpc > profile 2>&1 $PYTHONROOT/bin/python benchmark_batch.py --thread $thread_num --batch_size $batch_size --model serving_client_conf/serving_client_conf.prototxt --request rpc > profile 2>&1
echo "========================================" echo "========================================"
echo "batch size : $batch_size" >> profile_log echo "batch size : $batch_size" >> profile_log
$PYTHONROOT/bin/python ../util/show_profile.py profile $thread_num >> profile_log $PYTHONROOT/bin/python ../../../util/show_profile.py profile $thread_num >> profile_log
tail -n 1 profile >> profile_log tail -n 1 profile >> profile_log
done done
done done
...@@ -21,6 +21,7 @@ from paddle_serving_client.metric import auc ...@@ -21,6 +21,7 @@ from paddle_serving_client.metric import auc
import numpy as np import numpy as np
import sys import sys
class CriteoReader(object): class CriteoReader(object):
def __init__(self, sparse_feature_dim): def __init__(self, sparse_feature_dim):
self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...@@ -52,6 +53,7 @@ class CriteoReader(object): ...@@ -52,6 +53,7 @@ class CriteoReader(object):
return sparse_feature return sparse_feature
py_version = sys.version_info[0] py_version = sys.version_info[0]
client = Client() client = Client()
...@@ -68,8 +70,8 @@ for ei in range(10): ...@@ -68,8 +70,8 @@ for ei in range(10):
data = reader.process_line(f.readline()) data = reader.process_line(f.readline())
feed_dict = {} feed_dict = {}
for i in range(1, 27): for i in range(1, 27):
feed_dict["sparse_{}".format(i - 1)] = np.array(data[i-1]).reshape(-1) feed_dict["sparse_{}".format(i - 1)] = np.array(data[i - 1]).reshape(-1)
feed_dict["sparse_{}.lod".format(i - 1)] = [0, len(data[i-1])] feed_dict["sparse_{}.lod".format(i - 1)] = [0, len(data[i - 1])]
fetch_map = client.predict(feed=feed_dict, fetch=["prob"]) fetch_map = client.predict(feed=feed_dict, fetch=["prob"])
print(fetch_map) print(fetch_map)
end = time.time() end = time.time()
......
...@@ -65,8 +65,8 @@ bash benchmark.sh ...@@ -65,8 +65,8 @@ bash benchmark.sh
the average latency of threads the average latency of threads
![avg cost](../../../doc/images/criteo-cube-benchmark-avgcost.png) ![avg cost](../../../../doc/images/criteo-cube-benchmark-avgcost.png)
The QPS is The QPS is
![qps](../../../doc/images/criteo-cube-benchmark-qps.png) ![qps](../../../../doc/images/criteo-cube-benchmark-qps.png)
...@@ -63,8 +63,8 @@ bash benchmark.sh ...@@ -63,8 +63,8 @@ bash benchmark.sh
平均每个线程耗时图如下 平均每个线程耗时图如下
![avg cost](../../../doc/images/criteo-cube-benchmark-avgcost.png) ![avg cost](../../../../doc/images/criteo-cube-benchmark-avgcost.png)
每个线程QPS耗时如下 每个线程QPS耗时如下
![qps](../../../doc/images/criteo-cube-benchmark-qps.png) ![qps](../../../../doc/images/criteo-cube-benchmark-qps.png)
...@@ -25,6 +25,8 @@ from network_conf import dnn_model ...@@ -25,6 +25,8 @@ from network_conf import dnn_model
dense_feature_dim = 13 dense_feature_dim = 13
paddle.enable_static() paddle.enable_static()
def train(): def train():
args = parse_args() args = parse_args()
sparse_only = args.sparse_only sparse_only = args.sparse_only
......
...@@ -44,14 +44,13 @@ for ei in range(100): ...@@ -44,14 +44,13 @@ for ei in range(100):
feed_dict['dense_input'] = np.array(data[0][0]).reshape(1, len(data[0][0])) feed_dict['dense_input'] = np.array(data[0][0]).reshape(1, len(data[0][0]))
for i in range(1, 27): for i in range(1, 27):
feed_dict["embedding_{}.tmp_0".format(i - 1)] = np.array(data[0][i]).reshape(len(data[0][i])) feed_dict["embedding_{}.tmp_0".format(i - 1)] = np.array(data[0][
i]).reshape(len(data[0][i]))
feed_dict["embedding_{}.tmp_0.lod".format(i - 1)] = [0, len(data[0][i])] feed_dict["embedding_{}.tmp_0.lod".format(i - 1)] = [0, len(data[0][i])]
fetch_map = client.predict(feed=feed_dict, fetch=["prob"],batch=True) fetch_map = client.predict(feed=feed_dict, fetch=["prob"], batch=True)
print(fetch_map) print(fetch_map)
prob_list.append(fetch_map['prob'][0][1]) prob_list.append(fetch_map['prob'][0][1])
label_list.append(data[0][-1][0]) label_list.append(data[0][-1][0])
end = time.time() end = time.time()
print(end - start) print(end - start)
...@@ -43,7 +43,7 @@ do ...@@ -43,7 +43,7 @@ do
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "MAX_GPU_MEMORY:", max}' gpu_memory_use.log >> profile_log_$1
awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1 awk 'BEGIN {max = 0} {if(NR>1){if ($1 > max) max=$1}} END {print "GPU_UTILIZATION:", max}' gpu_utilization.log >> profile_log_$1
rm -rf gpu_use.log gpu_utilization.log rm -rf gpu_use.log gpu_utilization.log
$PYTHONROOT/bin/python3 ../util/show_profile.py profile $thread_num >> profile_log_$1 $PYTHONROOT/bin/python3 ../../util/show_profile.py profile $thread_num >> profile_log_$1
tail -n 8 profile >> profile_log_$1 tail -n 8 profile >> profile_log_$1
echo "" >> profile_log_$1 echo "" >> profile_log_$1
done done
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
...@@ -34,10 +33,13 @@ with open('processed.data') as f: ...@@ -34,10 +33,13 @@ with open('processed.data') as f:
"words.lod": [0, word_len] "words.lod": [0, word_len]
} }
fetch = ["acc", "cost", "prediction"] fetch = ["acc", "cost", "prediction"]
[fetch_map, tag] = client.predict(feed=feed, fetch=fetch, need_variant_tag=True,batch=True) [fetch_map, tag] = client.predict(
if (float(fetch_map["prediction"][0][1]) - 0.5) * (float(label[0]) - 0.5) > 0: feed=feed, fetch=fetch, need_variant_tag=True, batch=True)
if (float(fetch_map["prediction"][0][1]) - 0.5) * (float(label[0]) - 0.5
) > 0:
cnt[tag]['acc'] += 1 cnt[tag]['acc'] += 1
cnt[tag]['total'] += 1 cnt[tag]['total'] += 1
for tag, data in cnt.items(): for tag, data in cnt.items():
print('[{}](total: {}) acc: {}'.format(tag, data['total'], float(data['acc'])/float(data['total']) )) print('[{}](total: {}) acc: {}'.format(tag, data[
'total'], float(data['acc']) / float(data['total'])))
...@@ -20,4 +20,5 @@ with open('test_data/part-0') as fin: ...@@ -20,4 +20,5 @@ with open('test_data/part-0') as fin:
with open('processed.data', 'w') as fout: with open('processed.data', 'w') as fout:
for line in fin: for line in fin:
word_ids, label = imdb_dataset.get_words_and_label(line) word_ids, label = imdb_dataset.get_words_and_label(line)
fout.write("{};{}\n".format(','.join([str(x) for x in word_ids]), label[0])) fout.write("{};{}\n".format(','.join([str(x) for x in word_ids]),
label[0]))
...@@ -30,7 +30,7 @@ do ...@@ -30,7 +30,7 @@ do
echo "model_name:$1" >> profile_log_$1 echo "model_name:$1" >> profile_log_$1
echo "batch_size:$batch_size" >> profile_log_$1 echo "batch_size:$batch_size" >> profile_log_$1
job_et=`date '+%Y%m%d%H%M%S'` job_et=`date '+%Y%m%d%H%M%S'`
$PYTHONROOT/bin/python3 ../util/show_profile.py profile $thread_num >> profile_log_$1 $PYTHONROOT/bin/python3 ../../util/show_profile.py profile $thread_num >> profile_log_$1
$PYTHONROOT/bin/python3 cpu_utilization.py >> profile_log_$1 $PYTHONROOT/bin/python3 cpu_utilization.py >> profile_log_$1
tail -n 8 profile >> profile_log_$1 tail -n 8 profile >> profile_log_$1
echo "" >> profile_log_$1 echo "" >> profile_log_$1
......
...@@ -23,6 +23,7 @@ logger = logging.getLogger("fluid") ...@@ -23,6 +23,7 @@ logger = logging.getLogger("fluid")
logger.setLevel(logging.INFO) logger.setLevel(logging.INFO)
paddle.enable_static() paddle.enable_static()
def load_vocab(filename): def load_vocab(filename):
vocab = {} vocab = {}
with open(filename) as f: with open(filename) as f:
......
...@@ -17,8 +17,7 @@ from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop ...@@ -17,8 +17,7 @@ from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop
from paddle_serving_app.reader import RGB2BGR, Transpose, Div, Normalize from paddle_serving_app.reader import RGB2BGR, Transpose, Div, Normalize
client = Client() client = Client()
client.load_client_config( client.load_client_config("serving_client/serving_client_conf.prototxt")
"serving_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9393"]) client.connect(["127.0.0.1:9393"])
seq = Sequential([ seq = Sequential([
...@@ -28,5 +27,6 @@ seq = Sequential([ ...@@ -28,5 +27,6 @@ seq = Sequential([
image_file = "daisy.jpg" image_file = "daisy.jpg"
img = seq(image_file) img = seq(image_file)
fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0.tmp_0"]) fetch_map = client.predict(
feed={"image": img}, fetch=["save_infer_model/scale_0.tmp_0"])
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1)) print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
...@@ -49,9 +49,7 @@ class ChineseBertReader(BertBaseReader): ...@@ -49,9 +49,7 @@ class ChineseBertReader(BertBaseReader):
self.cls_id = self.vocab["[CLS]"] self.cls_id = self.vocab["[CLS]"]
self.sep_id = self.vocab["[SEP]"] self.sep_id = self.vocab["[SEP]"]
self.mask_id = self.vocab["[MASK]"] self.mask_id = self.vocab["[MASK]"]
self.feed_keys = [ self.feed_keys = ["input_ids", "token_type_ids"]
"input_ids", "token_type_ids"
]
""" """
inner function inner function
...@@ -90,7 +88,7 @@ class ChineseBertReader(BertBaseReader): ...@@ -90,7 +88,7 @@ class ChineseBertReader(BertBaseReader):
batch_text_type_ids, batch_text_type_ids,
max_seq_len=self.max_seq_len, max_seq_len=self.max_seq_len,
pad_idx=self.pad_id) pad_idx=self.pad_id)
return padded_token_ids, padded_text_type_ids return padded_token_ids, padded_text_type_ids
""" """
process function deals with a raw Chinese string as a sentence process function deals with a raw Chinese string as a sentence
......
...@@ -32,6 +32,6 @@ for line in sys.stdin: ...@@ -32,6 +32,6 @@ for line in sys.stdin:
feed_dict = reader.process(line) feed_dict = reader.process(line)
for key in feed_dict.keys(): for key in feed_dict.keys():
feed_dict[key] = np.array(feed_dict[key]).reshape((128, 1)) feed_dict[key] = np.array(feed_dict[key]).reshape((128, 1))
# print(feed_dict) # print(feed_dict)
result = client.predict(feed=feed_dict, fetch=fetch, batch=False) result = client.predict(feed=feed_dict, fetch=fetch, batch=False)
print(result) print(result)
...@@ -18,7 +18,8 @@ from paddle_serving_app.local_predict import LocalPredictor ...@@ -18,7 +18,8 @@ from paddle_serving_app.local_predict import LocalPredictor
import sys import sys
predictor = LocalPredictor() predictor = LocalPredictor()
predictor.load_model_config(sys.argv[1], use_lite=True, use_xpu=True, ir_optim=True) predictor.load_model_config(
sys.argv[1], use_lite=True, use_xpu=True, ir_optim=True)
seq = Sequential([ seq = Sequential([
File2Image(), Resize(256), CenterCrop(224), RGB2BGR(), Transpose((2, 0, 1)), File2Image(), Resize(256), CenterCrop(224), RGB2BGR(), Transpose((2, 0, 1)),
......
...@@ -17,8 +17,7 @@ from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop ...@@ -17,8 +17,7 @@ from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop
from paddle_serving_app.reader import RGB2BGR, Transpose, Div, Normalize from paddle_serving_app.reader import RGB2BGR, Transpose, Div, Normalize
client = Client() client = Client()
client.load_client_config( client.load_client_config("serving_client/serving_client_conf.prototxt")
"serving_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:7702"]) client.connect(["127.0.0.1:7702"])
seq = Sequential([ seq = Sequential([
...@@ -28,6 +27,7 @@ seq = Sequential([ ...@@ -28,6 +27,7 @@ seq = Sequential([
image_file = "daisy.jpg" image_file = "daisy.jpg"
img = seq(image_file) img = seq(image_file)
fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0"]) fetch_map = client.predict(
feed={"image": img}, fetch=["save_infer_model/scale_0"])
#print(fetch_map) #print(fetch_map)
print(fetch_map["save_infer_model/scale_0"].reshape(-1)) print(fetch_map["save_infer_model/scale_0"].reshape(-1))
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: false
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
tracer:
interval_s: 30
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18082
op:
faster_rcnn:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 2
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 1
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: '2'
#Fetch结果列表,以bert_seq128_model中fetch_var的alias_name为准, 如果没有设置则全部返回
fetch_list:
- save_infer_model/scale_0.tmp_1
#模型路径
model_config: serving_server/
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 9998
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
# PPYOLO model on Pipeline Paddle Serving
(简体中文|[English](./README_CN.md))
### 获取模型
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar
```
### 启动服务
```
tar xf ppyolo_mbv3_large_coco.tar
python3 web_service.py
```
### 执行预测
```
python3 pipeline_http_client.py
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: false
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
tracer:
interval_s: 30
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18082
op:
ppyolo_mbv3:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 10
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 1
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: '2'
#Fetch结果列表,以bert_seq128_model中fetch_var的alias_name为准, 如果没有设置则全部返回
fetch_list:
- save_infer_model/scale_0.tmp_1
#模型路径
model_config: serving_server/
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 9998
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
# YOLOv3 model on Pipeline Paddle Serving
(简体中文|[English](./README.md))
### 获取模型
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar
```
### 启动 WebService
```
tar xf yolov3_darknet53_270e_coco.tar
python3 web_service.py
```
### 执行预测
```
python3 pipeline_http_client.py
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: false
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
tracer:
interval_s: 30
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18082
op:
yolov3:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 10
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 1
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: '2'
#Fetch结果列表,以bert_seq128_model中fetch_var的alias_name为准, 如果没有设置则全部返回
fetch_list:
- save_infer_model/scale_0.tmp_1
#模型路径
model_config: serving_server/
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 9998
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: false
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: false
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
tracer:
interval_s: 10
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18082
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 9998
op:
bert:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 2
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 1
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: '2'
#Fetch结果列表,以bert_seq128_model中fetch_var的alias_name为准, 如果没有设置则全部返回
fetch_list:
#bert模型路径
model_config: bert_seq128_model/
...@@ -34,7 +34,7 @@ do ...@@ -34,7 +34,7 @@ do
awk -F' ' '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$modelname awk -F' ' '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$modelname
# Show profiles # Show profiles
python3 ../../util/show_profile.py profile $thread_num >> profile_log_$modelname python3 ../../../util/show_profile.py profile $thread_num >> profile_log_$modelname
tail -n 8 profile >> profile_log_$modelname tail -n 8 profile >> profile_log_$modelname
echo '' >> profile_log_$modelname echo '' >> profile_log_$modelname
done done
...@@ -78,7 +78,7 @@ do ...@@ -78,7 +78,7 @@ do
awk -F" " '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$modelname awk -F" " '{sum+=$1} END {print "GPU_UTILIZATION:", sum/NR, sum, NR }' gpu_utilization.log.tmp >> profile_log_$modelname
# Show profiles # Show profiles
python3 ../../util/show_profile.py profile $thread_num >> profile_log_$modelname python3 ../../../util/show_profile.py profile $thread_num >> profile_log_$modelname
tail -n 8 profile >> profile_log_$modelname tail -n 8 profile >> profile_log_$modelname
echo "" >> profile_log_$modelname echo "" >> profile_log_$modelname
done done
......
...@@ -38,6 +38,9 @@ op: ...@@ -38,6 +38,9 @@ op:
#Fetch结果列表,以client_config中fetch_var的alias_name为准 #Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["concat_1.tmp_0"] fetch_list: ["concat_1.tmp_0"]
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" devices: ""
...@@ -71,6 +74,8 @@ op: ...@@ -71,6 +74,8 @@ op:
#Fetch结果列表,以client_config中fetch_var的alias_name为准 #Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"] fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" devices: ""
......
...@@ -28,4 +28,4 @@ Specific operation: Open the chrome browser, enter `chrome://tracing/` in the ad ...@@ -28,4 +28,4 @@ Specific operation: Open the chrome browser, enter `chrome://tracing/` in the ad
The data visualization output is shown as follow, it uses [bert as service example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server. The data visualization output is shown as follow, it uses [bert as service example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server.
![timeline](../../../doc/images/timeline-example.png) ![timeline](../../doc/images/timeline-example.png)
...@@ -28,4 +28,4 @@ python3 timeline_trace.py profile trace ...@@ -28,4 +28,4 @@ python3 timeline_trace.py profile trace
效果如下图,图中展示了使用[bert示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)的GPU预测服务,server端开启4卡预测,client端启动4进程,batch size为1时的各阶段timeline,其中bert_pre代表client端的数据预处理阶段,client_infer代表client完成预测请求的发送和接收结果的阶段,图中的process代表的是client的进程号,每个进进程的第二行展示的是server各个op的timeline。 效果如下图,图中展示了使用[bert示例](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert)的GPU预测服务,server端开启4卡预测,client端启动4进程,batch size为1时的各阶段timeline,其中bert_pre代表client端的数据预处理阶段,client_infer代表client完成预测请求的发送和接收结果的阶段,图中的process代表的是client的进程号,每个进进程的第二行展示的是server各个op的timeline。
![timeline](../../../doc/images/timeline-example.png) ![timeline](../../doc/images/timeline-example.png)
...@@ -73,7 +73,7 @@ java -cp paddle-serving-sdk-java-examples-0.0.1-jar-with-dependencies.jar Pipeli ...@@ -73,7 +73,7 @@ java -cp paddle-serving-sdk-java-examples-0.0.1-jar-with-dependencies.jar Pipeli
1.In the example, all models(not pipeline) need to use `--use_multilang` to start GRPC multi-programming language support, and the port number is 9393. If you need another port, you need to modify it in the java file 1.In the example, all models(not pipeline) need to use `--use_multilang` to start GRPC multi-programming language support, and the port number is 9393. If you need another port, you need to modify it in the java file
2.Currently Serving has launched the Pipeline mode (see [Pipeline Serving](../doc/PIPELINE_SERVING.md) for details). Pipeline Serving Client for Java is released. 2.Currently Serving has launched the Pipeline mode (see [Pipeline Serving](../doc/Python_Pipeline/Pipeline_Design_EN.md) for details). Pipeline Serving Client for Java is released.
3.The parameters`ip` and`port` in PipelineClientExample.java(path:java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),needs to be connected with the corresponding pipeline server parameters`ip` and`port` which is defined in the config.yaml(Taking IMDB model ensemble as an example,path:python/examples/pipeline/imdb_model_ensemble/[config.yaml](../python/examples/pipeline/imdb_model_ensemble/config.yml) 3.The parameters`ip` and`port` in PipelineClientExample.java(path:java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),needs to be connected with the corresponding pipeline server parameters`ip` and`port` which is defined in the config.yaml(Taking IMDB model ensemble as an example,path:python/examples/pipeline/imdb_model_ensemble/[config.yaml](../python/examples/pipeline/imdb_model_ensemble/config.yml)
......
...@@ -100,7 +100,7 @@ java -cp paddle-serving-sdk-java-examples-0.0.1-jar-with-dependencies.jar Pipeli ...@@ -100,7 +100,7 @@ java -cp paddle-serving-sdk-java-examples-0.0.1-jar-with-dependencies.jar Pipeli
1.在示例中,端口号都是9393,ip默认设置为了127.0.0.1表示本机,注意ip和port需要与Server端对应。 1.在示例中,端口号都是9393,ip默认设置为了127.0.0.1表示本机,注意ip和port需要与Server端对应。
2.目前Serving已推出Pipeline模式(原理详见[Pipeline Serving](../doc/PIPELINE_SERVING_CN.md)),面向Java的Pipeline Serving Client已发布。 2.目前Serving已推出Pipeline模式(原理详见[Pipeline Serving](../doc/Python_Pipeline/Pipeline_Design_CN.md)),面向Java的Pipeline Serving Client已发布。
3.注意PipelineClientExample.java中的ip和port(位于java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),需要与对应Pipeline server的config.yaml文件中配置的ip和port相对应。(以IMDB model ensemble模型为例,位于python/examples/pipeline/imdb_model_ensemble/[config.yaml](../python/examples/pipeline/imdb_model_ensemble/config.yml) 3.注意PipelineClientExample.java中的ip和port(位于java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),需要与对应Pipeline server的config.yaml文件中配置的ip和port相对应。(以IMDB model ensemble模型为例,位于python/examples/pipeline/imdb_model_ensemble/[config.yaml](../python/examples/pipeline/imdb_model_ensemble/config.yml)
......
...@@ -73,7 +73,7 @@ if (SERVER) ...@@ -73,7 +73,7 @@ if (SERVER)
set(VERSION_SUFFIX 101) set(VERSION_SUFFIX 101)
elseif(CUDA_VERSION EQUAL 10.2) elseif(CUDA_VERSION EQUAL 10.2)
if(CUDNN_MAJOR_VERSION EQUAL 7) if(CUDNN_MAJOR_VERSION EQUAL 7)
set(VERSION_SUFFIX 1027) set(VERSION_SUFFIX 102)
elseif(CUDNN_MAJOR_VERSION EQUAL 8) elseif(CUDNN_MAJOR_VERSION EQUAL 8)
set(VERSION_SUFFIX 1028) set(VERSION_SUFFIX 1028)
endif() endif()
......
## Examples
### Support `--use_trt`
the following models support `--use_trt`, which means you can use TensorRT to accelerate inference at Cuda 10.1 or higher.
- imagenet ResNet50/ResNet101
- detection faster_rcnn/yolov3/pp-yolo/ttf-net
## Serving模型示例
### 支持TensorRT的模型列表 `--use_trt`
以下模型支持TensorRT,可以开启 `--use_trt`来加速在线预测,其他模型不能开启。
- imagenet ResNet50/ResNet101
- detection faster_rcnn/yolov3/pp-yolo/ttf-net
dag:
is_thread_op: false
tracer:
interval_s: 30
http_port: 18082
op:
faster_rcnn:
concurrency: 2
local_service_conf:
client_type: local_predictor
device_type: 1
devices: '2'
fetch_list:
- save_infer_model/scale_0.tmp_1
model_config: serving_server/
rpc_port: 9998
worker_num: 20
dag:
is_thread_op: false
tracer:
interval_s: 30
http_port: 18082
op:
ppyolo_mbv3:
concurrency: 10
local_service_conf:
client_type: local_predictor
device_type: 1
devices: '2'
fetch_list:
- save_infer_model/scale_0.tmp_1
model_config: serving_server/
rpc_port: 9998
worker_num: 20
dag:
is_thread_op: false
tracer:
interval_s: 30
http_port: 18082
op:
yolov3:
concurrency: 10
local_service_conf:
client_type: local_predictor
device_type: 1
devices: '2'
fetch_list:
- save_infer_model/scale_0.tmp_1
model_config: serving_server/
rpc_port: 9998
worker_num: 20
worker_num: 20
dag:
is_thread_op: false
tracer:
interval_s: 10
http_port: 18082
rpc_port: 9998
op:
bert:
concurrency: 2
local_service_conf:
client_type: local_predictor
device_type: 1
devices: '2'
fetch_list:
model_config: bert_seq128_model/
...@@ -67,7 +67,7 @@ Preprocessing for Chinese word segmentation task. ...@@ -67,7 +67,7 @@ Preprocessing for Chinese word segmentation task.
- words(st ):Original text input. - words(st ):Original text input.
- crf_decode(np.array):CRF code predicted by model. - crf_decode(np.array):CRF code predicted by model.
[example](../examples/lac/lac_web_service.py) [example](../examples/lac/lac_http_client.py)
- class SentaReader - class SentaReader
......
...@@ -60,7 +60,7 @@ paddle_serving_app针对CV和NLP领域的模型任务,提供了多种常见的 ...@@ -60,7 +60,7 @@ paddle_serving_app针对CV和NLP领域的模型任务,提供了多种常见的
- words(str):原始文本 - words(str):原始文本
- crf_decode(np.array):模型预测结果中的CRF编码 - crf_decode(np.array):模型预测结果中的CRF编码
[参考示例](../examples/lac/lac_web_service.py) [参考示例](../examples/lac/lac_http_client.py)
- class SentaReader - class SentaReader
......
...@@ -19,7 +19,7 @@ from paddle.fluid.framework import core ...@@ -19,7 +19,7 @@ from paddle.fluid.framework import core
from paddle.fluid.framework import default_main_program from paddle.fluid.framework import default_main_program
from paddle.fluid.framework import Program from paddle.fluid.framework import Program
from paddle.fluid import CPUPlace from paddle.fluid import CPUPlace
from paddle.fluid.io import save_inference_model from .paddle_io import save_inference_model, normalize_program
import paddle.fluid as fluid import paddle.fluid as fluid
from paddle.fluid.core import CipherUtils from paddle.fluid.core import CipherUtils
from paddle.fluid.core import CipherFactory from paddle.fluid.core import CipherFactory
...@@ -191,12 +191,14 @@ def save_model(server_model_folder, ...@@ -191,12 +191,14 @@ def save_model(server_model_folder,
executor = Executor(place=CPUPlace()) executor = Executor(place=CPUPlace())
feed_var_names = [feed_var_dict[x].name for x in feed_var_dict] feed_var_names = [feed_var_dict[x].name for x in feed_var_dict]
feed_vars = [feed_var_dict[x] for x in feed_var_dict]
target_vars = [] target_vars = []
target_var_names = [] target_var_names = []
for key in sorted(fetch_var_dict.keys()): for key in sorted(fetch_var_dict.keys()):
target_vars.append(fetch_var_dict[key]) target_vars.append(fetch_var_dict[key])
target_var_names.append(key) target_var_names.append(key)
main_program = normalize_program(main_program, feed_vars, target_vars)
if not encryption and not show_proto: if not encryption and not show_proto:
if not os.path.exists(server_model_folder): if not os.path.exists(server_model_folder):
os.makedirs(server_model_folder) os.makedirs(server_model_folder)
...@@ -209,7 +211,7 @@ def save_model(server_model_folder, ...@@ -209,7 +211,7 @@ def save_model(server_model_folder,
new_params_path = os.path.join(server_model_folder, params_filename) new_params_path = os.path.join(server_model_folder, params_filename)
with open(new_model_path, "wb") as new_model_file: with open(new_model_path, "wb") as new_model_file:
new_model_file.write(main_program.desc.serialize_to_string()) new_model_file.write(main_program._remove_training_info(False).desc.serialize_to_string())
paddle.static.save_vars( paddle.static.save_vars(
executor=executor, executor=executor,
...@@ -229,7 +231,7 @@ def save_model(server_model_folder, ...@@ -229,7 +231,7 @@ def save_model(server_model_folder,
key = CipherUtils.gen_key_to_file(128, "key") key = CipherUtils.gen_key_to_file(128, "key")
params = fluid.io.save_persistables( params = fluid.io.save_persistables(
executor=executor, dirname=None, main_program=main_program) executor=executor, dirname=None, main_program=main_program)
model = main_program.desc.serialize_to_string() model = main_program._remove_training_info(False).desc.serialize_to_string()
if not os.path.exists(server_model_folder): if not os.path.exists(server_model_folder):
os.makedirs(server_model_folder) os.makedirs(server_model_folder)
os.chdir(server_model_folder) os.chdir(server_model_folder)
......
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import errno
import inspect
import logging
import os
import warnings
import six
import numpy as np
import paddle
from paddle.fluid import (
core,
Variable,
CompiledProgram,
default_main_program,
Program,
layers,
unique_name,
program_guard, )
from paddle.fluid.io import prepend_feed_ops, append_fetch_ops
from paddle.fluid.framework import static_only, Parameter
from paddle.fluid.executor import Executor, global_scope
from paddle.fluid.log_helper import get_logger
__all__ = []
_logger = get_logger(
__name__, logging.INFO, fmt='%(asctime)s-%(levelname)s: %(message)s')
def _check_args(caller, args, supported_args=None, deprecated_args=None):
supported_args = [] if supported_args is None else supported_args
deprecated_args = [] if deprecated_args is None else deprecated_args
for arg in args:
if arg in deprecated_args:
raise ValueError(
"argument '{}' in function '{}' is deprecated, only {} are supported.".
format(arg, caller, supported_args))
elif arg not in supported_args:
raise ValueError(
"function '{}' doesn't support argument '{}',\n only {} are supported.".
format(caller, arg, supported_args))
def _check_vars(name, var_list):
if not isinstance(var_list, list):
var_list = [var_list]
if not var_list or not all([isinstance(var, Variable) for var in var_list]):
raise ValueError(
"'{}' should be a Variable or a list of Variable.".format(name))
def _normalize_path_prefix(path_prefix):
"""
convert path_prefix to absolute path.
"""
if not isinstance(path_prefix, six.string_types):
raise ValueError("'path_prefix' should be a string.")
if path_prefix.endswith("/"):
raise ValueError("'path_prefix' should not be a directory")
path_prefix = os.path.normpath(path_prefix)
path_prefix = os.path.abspath(path_prefix)
return path_prefix
def _get_valid_program(program=None):
"""
return default main program if program is None.
"""
if program is None:
program = default_main_program()
elif isinstance(program, CompiledProgram):
program = program._program
if program is None:
raise TypeError(
"The type of input program is invalid, expected tyep is Program, but received None"
)
warnings.warn(
"The input is a CompiledProgram, this is not recommended.")
if not isinstance(program, Program):
raise TypeError(
"The type of input program is invalid, expected type is fluid.Program, but received %s"
% type(program))
return program
def _clone_var_in_block(block, var):
assert isinstance(var, Variable)
if var.desc.type() == core.VarDesc.VarType.LOD_TENSOR:
return block.create_var(
name=var.name,
shape=var.shape,
dtype=var.dtype,
type=var.type,
lod_level=var.lod_level,
persistable=True)
else:
return block.create_var(
name=var.name,
shape=var.shape,
dtype=var.dtype,
type=var.type,
persistable=True)
def normalize_program(program, feed_vars, fetch_vars):
"""
:api_attr: Static Graph
Normalize/Optimize a program according to feed_vars and fetch_vars.
Args:
program(Program): Specify a program you want to optimize.
feed_vars(Variable | list[Variable]): Variables needed by inference.
fetch_vars(Variable | list[Variable]): Variables returned by inference.
Returns:
Program: Normalized/Optimized program.
Raises:
TypeError: If `program` is not a Program, an exception is thrown.
TypeError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
TypeError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
Examples:
.. code-block:: python
import paddle
paddle.enable_static()
path_prefix = "./infer_model"
# User defined network, here a softmax regession example
image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
predict = paddle.static.nn.fc(image, 10, activation='softmax')
loss = paddle.nn.functional.cross_entropy(predict, label)
exe = paddle.static.Executor(paddle.CPUPlace())
exe.run(paddle.static.default_startup_program())
# normalize main program.
program = paddle.static.default_main_program()
normalized_program = paddle.static.normalize_program(program, [image], [predict])
"""
if not isinstance(program, Program):
raise TypeError(
"program type must be `fluid.Program`, but received `%s`" %
type(program))
if not isinstance(feed_vars, list):
feed_vars = [feed_vars]
if not all(isinstance(v, Variable) for v in feed_vars):
raise TypeError(
"feed_vars type must be a Variable or a list of Variable.")
if not isinstance(fetch_vars, list):
fetch_vars = [fetch_vars]
if not all(isinstance(v, Variable) for v in fetch_vars):
raise TypeError(
"fetch_vars type must be a Variable or a list of Variable.")
# remind users to set auc_states to 0 if auc op were found.
for op in program.global_block().ops:
# clear device of Op
device_attr_name = core.op_proto_and_checker_maker.kOpDeviceAttrName()
op._set_attr(device_attr_name, "")
if op.type == 'auc':
warnings.warn("Be sure that you have set auc states to 0 "
"before saving inference model.")
break
# fix the bug that the activation op's output as target will be pruned.
# will affect the inference performance.
# TODO(Superjomn) add an IR pass to remove 1-scale op.
#with program_guard(program):
# uniq_fetch_vars = []
# for i, var in enumerate(fetch_vars):
# if var.dtype != paddle.bool:
# var = layers.scale(
# var, 1., name="save_infer_model/scale_{}".format(i))
# uniq_fetch_vars.append(var)
# fetch_vars = uniq_fetch_vars
# serialize program
copy_program = program.clone()
global_block = copy_program.global_block()
remove_op_idx = []
for i, op in enumerate(global_block.ops):
op.desc.set_is_target(False)
if op.type == "feed" or op.type == "fetch":
remove_op_idx.append(i)
for idx in remove_op_idx[::-1]:
global_block._remove_op(idx)
copy_program.desc.flush()
feed_var_names = [var.name for var in feed_vars]
copy_program = copy_program._prune_with_input(
feeded_var_names=feed_var_names, targets=fetch_vars)
copy_program = copy_program._inference_optimize(prune_read_op=True)
fetch_var_names = [var.name for var in fetch_vars]
prepend_feed_ops(copy_program, feed_var_names)
append_fetch_ops(copy_program, fetch_var_names)
copy_program.desc._set_version()
return copy_program
def is_persistable(var):
"""
Check whether the given variable is persistable.
Args:
var(Variable): The variable to be checked.
Returns:
bool: True if the given `var` is persistable
False if not.
Examples:
.. code-block:: python
import paddle
import paddle.fluid as fluid
paddle.enable_static()
param = fluid.default_main_program().global_block().var('fc.b')
res = fluid.io.is_persistable(param)
"""
if var.desc.type() == core.VarDesc.VarType.FEED_MINIBATCH or \
var.desc.type() == core.VarDesc.VarType.FETCH_LIST or \
var.desc.type() == core.VarDesc.VarType.READER:
return False
return var.persistable
@static_only
def serialize_program(feed_vars, fetch_vars, **kwargs):
"""
:api_attr: Static Graph
Serialize default main program according to feed_vars and fetch_vars.
Args:
feed_vars(Variable | list[Variable]): Variables needed by inference.
fetch_vars(Variable | list[Variable]): Variables returned by inference.
kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
- program(Program): specify a program if you don't want to use default main program.
Returns:
bytes: serialized program.
Raises:
ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
Examples:
.. code-block:: python
import paddle
paddle.enable_static()
path_prefix = "./infer_model"
# User defined network, here a softmax regession example
image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
predict = paddle.static.nn.fc(image, 10, activation='softmax')
loss = paddle.nn.functional.cross_entropy(predict, label)
exe = paddle.static.Executor(paddle.CPUPlace())
exe.run(paddle.static.default_startup_program())
# serialize the default main program to bytes.
serialized_program = paddle.static.serialize_program([image], [predict])
# deserialize bytes to program
deserialized_program = paddle.static.deserialize_program(serialized_program)
"""
# verify feed_vars
_check_vars('feed_vars', feed_vars)
# verify fetch_vars
_check_vars('fetch_vars', fetch_vars)
program = _get_valid_program(kwargs.get('program', None))
program = normalize_program(program, feed_vars, fetch_vars)
return _serialize_program(program)
def _serialize_program(program):
"""
serialize given program to bytes.
"""
return program.desc.serialize_to_string()
@static_only
def serialize_persistables(feed_vars, fetch_vars, executor, **kwargs):
"""
:api_attr: Static Graph
Serialize parameters using given executor and default main program according to feed_vars and fetch_vars.
Args:
feed_vars(Variable | list[Variable]): Variables needed by inference.
fetch_vars(Variable | list[Variable]): Variables returned by inference.
kwargs: Supported keys including 'program'.Attention please, kwargs is used for backward compatibility mainly.
- program(Program): specify a program if you don't want to use default main program.
Returns:
bytes: serialized program.
Raises:
ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
Examples:
.. code-block:: python
import paddle
paddle.enable_static()
path_prefix = "./infer_model"
# User defined network, here a softmax regession example
image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
predict = paddle.static.nn.fc(image, 10, activation='softmax')
loss = paddle.nn.functional.cross_entropy(predict, label)
exe = paddle.static.Executor(paddle.CPUPlace())
exe.run(paddle.static.default_startup_program())
# serialize parameters to bytes.
serialized_params = paddle.static.serialize_persistables([image], [predict], exe)
# deserialize bytes to parameters.
main_program = paddle.static.default_main_program()
deserialized_params = paddle.static.deserialize_persistables(main_program, serialized_params, exe)
"""
# verify feed_vars
_check_vars('feed_vars', feed_vars)
# verify fetch_vars
_check_vars('fetch_vars', fetch_vars)
program = _get_valid_program(kwargs.get('program', None))
program = normalize_program(program, feed_vars, fetch_vars)
return _serialize_persistables(program, executor)
def _serialize_persistables(program, executor):
"""
Serialize parameters using given program and executor.
"""
vars_ = list(filter(is_persistable, program.list_vars()))
# warn if no variable found in model
if len(vars_) == 0:
warnings.warn("no variable in your model, please ensure there are any "
"variables in your model to save")
return None
# create a new program and clone persitable vars to it
save_program = Program()
save_block = save_program.global_block()
save_var_map = {}
for var in vars_:
if var.type != core.VarDesc.VarType.RAW:
var_copy = _clone_var_in_block(save_block, var)
save_var_map[var_copy.name] = var
# create in_vars and out_var, then append a save_combine op to save_program
in_vars = []
for name in sorted(save_var_map.keys()):
in_vars.append(save_var_map[name])
out_var_name = unique_name.generate("out_var")
out_var = save_block.create_var(
type=core.VarDesc.VarType.RAW, name=out_var_name)
out_var.desc.set_persistable(True)
save_block.append_op(
type='save_combine',
inputs={'X': in_vars},
outputs={'Y': out_var},
attrs={'file_path': '',
'save_to_memory': True})
# run save_program to save vars
# NOTE(zhiqiu): save op will add variable kLookupTablePath to save_program.desc,
# which leads to diff between save_program and its desc. Call _sync_with_cpp
# to keep consistency.
save_program._sync_with_cpp()
executor.run(save_program)
# return serialized bytes in out_var
return global_scope().find_var(out_var_name).get_bytes()
def save_to_file(path, content):
"""
Save content to given path.
Args:
path(str): Path to write content to.
content(bytes): Content to write.
Returns:
None
"""
if not isinstance(content, bytes):
raise ValueError("'content' type should be bytes.")
with open(path, "wb") as f:
f.write(content)
@static_only
def save_inference_model(path_prefix, feed_vars, fetch_vars, executor,
**kwargs):
"""
:api_attr: Static Graph
Save current model and its parameters to given path. i.e.
Given path_prefix = "/path/to/modelname", after invoking
save_inference_model(path_prefix, feed_vars, fetch_vars, executor),
you will find two files named modelname.pdmodel and modelname.pdiparams
under "/path/to", which represent your model and parameters respectively.
Args:
path_prefix(str): Directory path to save model + model name without suffix.
feed_vars(Variable | list[Variable]): Variables needed by inference.
fetch_vars(Variable | list[Variable]): Variables returned by inference.
executor(Executor): The executor that saves the inference model. You can refer
to :ref:`api_guide_executor_en` for more details.
kwargs: Supported keys including 'program' and "clip_extra". Attention please, kwargs is used for backward compatibility mainly.
- program(Program): specify a program if you don't want to use default main program.
- clip_extra(bool): set to True if you want to clip extra information for every operator.
Returns:
None
Raises:
ValueError: If `feed_vars` is not a Variable or a list of Variable, an exception is thrown.
ValueError: If `fetch_vars` is not a Variable or a list of Variable, an exception is thrown.
Examples:
.. code-block:: python
import paddle
paddle.enable_static()
path_prefix = "./infer_model"
# User defined network, here a softmax regession example
image = paddle.static.data(name='img', shape=[None, 28, 28], dtype='float32')
label = paddle.static.data(name='label', shape=[None, 1], dtype='int64')
predict = paddle.static.nn.fc(image, 10, activation='softmax')
loss = paddle.nn.functional.cross_entropy(predict, label)
exe = paddle.static.Executor(paddle.CPUPlace())
exe.run(paddle.static.default_startup_program())
# Feed data and train process
# Save inference model. Note we don't save label and loss in this example
paddle.static.save_inference_model(path_prefix, [image], [predict], exe)
# In this example, the save_inference_mode inference will prune the default
# main program according to the network's input node (img) and output node(predict).
# The pruned inference program is going to be saved in file "./infer_model.pdmodel"
# and parameters are going to be saved in file "./infer_model.pdiparams".
"""
# check path_prefix, set model_path and params_path
path_prefix = _normalize_path_prefix(path_prefix)
try:
# mkdir may conflict if pserver and trainer are running on the same machine
dirname = os.path.dirname(path_prefix)
os.makedirs(dirname)
except OSError as e:
if e.errno != errno.EEXIST:
raise
model_path = path_prefix + ".pdmodel"
params_path = path_prefix + ".pdiparams"
if os.path.isdir(model_path):
raise ValueError("'{}' is an existing directory.".format(model_path))
if os.path.isdir(params_path):
raise ValueError("'{}' is an existing directory.".format(params_path))
# verify feed_vars
_check_vars('feed_vars', feed_vars)
# verify fetch_vars
_check_vars('fetch_vars', fetch_vars)
program = _get_valid_program(kwargs.get('program', None))
clip_extra = kwargs.get('clip_extra', False)
program = normalize_program(program, feed_vars, fetch_vars)
# serialize and save program
program_bytes = _serialize_program(
program._remove_training_info(clip_extra=clip_extra))
save_to_file(model_path, program_bytes)
# serialize and save params
params_bytes = _serialize_persistables(program, executor)
save_to_file(params_path, params_bytes)
...@@ -429,7 +429,7 @@ class Server(object): ...@@ -429,7 +429,7 @@ class Server(object):
if device_type == "0": if device_type == "0":
device_version = self.get_device_version() device_version = self.get_device_version()
elif device_type == "1": elif device_type == "1":
if version_suffix == "101" or version_suffix == "1027" or version_suffix == "1028" or version_suffix == "112": if version_suffix == "101" or version_suffix == "102" or version_suffix == "1028" or version_suffix == "112":
device_version = "gpu-" + version_suffix device_version = "gpu-" + version_suffix
else: else:
device_version = "gpu-cuda" + version_suffix device_version = "gpu-cuda" + version_suffix
......
...@@ -16,6 +16,7 @@ import signal ...@@ -16,6 +16,7 @@ import signal
import os import os
import time import time
import json import json
import platform
from paddle_serving_server.env import CONF_HOME from paddle_serving_server.env import CONF_HOME
...@@ -91,7 +92,10 @@ def dump_pid_file(portList, model): ...@@ -91,7 +92,10 @@ def dump_pid_file(portList, model):
dump_pid_file([9494, 10082], 'serve') dump_pid_file([9494, 10082], 'serve')
''' '''
pid = os.getpid() pid = os.getpid()
gid = os.getpgid(pid) if platform.system() == "Windows":
gid = pid
else:
gid = os.getpgid(pid)
pidInfoList = [] pidInfoList = []
filepath = os.path.join(CONF_HOME, "ProcessInfo.json") filepath = os.path.join(CONF_HOME, "ProcessInfo.json")
if os.path.exists(filepath): if os.path.exists(filepath):
......
...@@ -108,7 +108,7 @@ class PipelineClient(object): ...@@ -108,7 +108,7 @@ class PipelineClient(object):
one_tensor.name = key one_tensor.name = key
if isinstance(value, str): if isinstance(value, str):
one_tensor.string_data.add(value) one_tensor.str_data.append(value)
one_tensor.elem_type = 12 #12 => string in proto one_tensor.elem_type = 12 #12 => string in proto
continue continue
......
...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/ ...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/
RUN rm -r /root/python_build RUN rm -r /root/python_build
# Install Go and glide # Install Go and glide
RUN wget -qO- https://dl.google.com/go/go1.14.linux-amd64.tar.gz | \ RUN wget -qO- https://paddle-ci.cdn.bcebos.com/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \ tar -xz -C /usr/local && \
mkdir /root/go && \ mkdir /root/go && \
mkdir /root/go/bin && \ mkdir /root/go/bin && \
......
...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/ ...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/
RUN rm -r /root/python_build RUN rm -r /root/python_build
# Install Go and glide # Install Go and glide
RUN wget -qO- https://dl.google.com/go/go1.14.linux-amd64.tar.gz | \ RUN wget -qO- https://paddle-ci.cdn.bcebos.com/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \ tar -xz -C /usr/local && \
mkdir /root/go && \ mkdir /root/go && \
mkdir /root/go/bin && \ mkdir /root/go/bin && \
......
...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/ ...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/
RUN rm -r /root/python_build RUN rm -r /root/python_build
# Install Go and glide # Install Go and glide
RUN wget -qO- https://dl.google.com/go/go1.14.linux-amd64.tar.gz | \ RUN wget -qO- https://paddle-ci.cdn.bcebos.com/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \ tar -xz -C /usr/local && \
mkdir /root/go && \ mkdir /root/go && \
mkdir /root/go/bin && \ mkdir /root/go/bin && \
......
# A image for building paddle binaries
# Use cuda devel base image for both cpu and gpu environment
# When you modify it, please be aware of cudnn-runtime version
FROM nvidia/cuda:11.2.0-cudnn8-devel-ubuntu16.04
MAINTAINER PaddlePaddle Authors <paddle-dev@baidu.com>
# ENV variables
ARG WITH_GPU
ARG WITH_AVX
ENV WITH_GPU=${WITH_GPU:-ON}
ENV WITH_AVX=${WITH_AVX:-ON}
ENV HOME /root
# Add bash enhancements
COPY tools/dockerfiles/root/ /root/
# Prepare packages for Python
RUN apt-get update && \
apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \
libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev \
xz-utils tk-dev libffi-dev liblzma-dev
RUN apt-get update && \
apt-get install -y --allow-downgrades --allow-change-held-packages \
patchelf git python-pip python-dev python-opencv openssh-server bison \
wget unzip unrar tar xz-utils bzip2 gzip coreutils ntp \
curl sed grep graphviz libjpeg-dev zlib1g-dev \
python-matplotlib unzip \
automake locales clang-format swig \
liblapack-dev liblapacke-dev libcurl4-openssl-dev \
net-tools libtool module-init-tools vim && \
apt-get clean -y
RUN ln -s /usr/lib/x86_64-linux-gnu/libssl.so /usr/lib/libssl.so.10 && \
ln -s /usr/lib/x86_64-linux-gnu/libcrypto.so /usr/lib/libcrypto.so.10
RUN wget https://github.com/koalaman/shellcheck/releases/download/v0.7.1/shellcheck-v0.7.1.linux.x86_64.tar.xz -O shellcheck-v0.7.1.linux.x86_64.tar.xz && \
tar -xf shellcheck-v0.7.1.linux.x86_64.tar.xz && cp shellcheck-v0.7.1/shellcheck /usr/bin/shellcheck && \
rm -rf shellcheck-v0.7.1.linux.x86_64.tar.xz shellcheck-v0.7.1
# Downgrade gcc&&g++
WORKDIR /usr/bin
COPY tools/dockerfiles/build_scripts /build_scripts
RUN bash /build_scripts/install_gcc.sh gcc82 && rm -rf /build_scripts
RUN cp gcc gcc.bak && cp g++ g++.bak && rm gcc && rm g++
RUN ln -s /usr/local/gcc-8.2/bin/gcc /usr/local/bin/gcc
RUN ln -s /usr/local/gcc-8.2/bin/g++ /usr/local/bin/g++
RUN ln -s /usr/local/gcc-8.2/bin/gcc /usr/bin/gcc
RUN ln -s /usr/local/gcc-8.2/bin/g++ /usr/bin/g++
ENV PATH=/usr/local/gcc-8.2/bin:$PATH
# install cmake
WORKDIR /home
RUN wget -q https://cmake.org/files/v3.16/cmake-3.16.0-Linux-x86_64.tar.gz && tar -zxvf cmake-3.16.0-Linux-x86_64.tar.gz && rm cmake-3.16.0-Linux-x86_64.tar.gz
ENV PATH=/home/cmake-3.16.0-Linux-x86_64/bin:$PATH
# Install Python3.6
RUN mkdir -p /root/python_build/ && wget -q https://www.sqlite.org/2018/sqlite-autoconf-3250300.tar.gz && \
tar -zxf sqlite-autoconf-3250300.tar.gz && cd sqlite-autoconf-3250300 && \
./configure -prefix=/usr/local && make -j8 && make install && cd ../ && rm sqlite-autoconf-3250300.tar.gz
RUN wget -q https://www.python.org/ftp/python/3.6.0/Python-3.6.0.tgz && \
tar -xzf Python-3.6.0.tgz && cd Python-3.6.0 && \
CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \
make -j8 > /dev/null && make altinstall > /dev/null && ldconfig && cd .. && rm -rf Python-3.6.0*
# Install Python3.7
RUN wget -q https://www.python.org/ftp/python/3.7.0/Python-3.7.0.tgz && \
tar -xzf Python-3.7.0.tgz && cd Python-3.7.0 && \
CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \
make -j8 > /dev/null && make altinstall > /dev/null && ldconfig && cd .. && rm -rf Python-3.7.0*
# Install Python3.8
RUN wget -q https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz && \
tar -xzf Python-3.8.0.tgz && cd Python-3.8.0 && \
CFLAGS="-Wformat" ./configure --prefix=/usr/local/ --enable-shared > /dev/null && \
make -j8 > /dev/null && make altinstall > /dev/null && ldconfig && cd .. && rm -rf Python-3.8.0*
ENV LD_LIBRARY_PATH=/usr/local/lib:${LD_LIBRARY_PATH}
RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/bin/python3.6 /usr/bin/python3 && ln -sf /usr/local/bin/pip3.6 /usr/local/bin/pip3 && ln -sf /usr/local/bin/pip3.6 /usr/bin/pip3
RUN rm -r /root/python_build
# Install Go and glide
RUN wget -qO- https://paddle-ci.cdn.bcebos.com/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \
mkdir /root/go && \
mkdir /root/go/bin && \
mkdir /root/go/src && \
echo "GOROOT=/usr/local/go" >> /root/.bashrc && \
echo "GOPATH=/root/go" >> /root/.bashrc && \
echo "PATH=/usr/local/go/bin:/root/go/bin:$PATH" >> /root/.bashrc
ENV GOROOT=/usr/local/go GOPATH=/root/go
# should not be in the same line with GOROOT definition, otherwise docker build could not find GOROOT.
ENV PATH=usr/local/go/bin:/root/go/bin:${PATH}
# Install TensorRT
# following TensorRT.tar.gz is not the default official one, we do two miny changes:
# 1. Remove the unnecessary files to make the library small. TensorRT.tar.gz only contains include and lib now,
# and its size is only one-third of the official one.
# 2. Manually add ~IPluginFactory() in IPluginFactory class of NvInfer.h, otherwise, it couldn't work in paddle.
# See https://github.com/PaddlePaddle/Paddle/issues/10129 for details.
# Downgrade TensorRT
COPY tools/dockerfiles/build_scripts /build_scripts
RUN bash /build_scripts/install_trt.sh cuda11.2
RUN rm -rf /build_scripts
# git credential to skip password typing
RUN git config --global credential.helper store
# Fix locales to en_US.UTF-8
RUN localedef -i en_US -f UTF-8 en_US.UTF-8
RUN apt-get install libprotobuf-dev -y
# Older versions of patchelf limited the size of the files being processed and were fixed in this pr.
# https://github.com/NixOS/patchelf/commit/ba2695a8110abbc8cc6baf0eea819922ee5007fa
# So install a newer version here.
RUN wget -q https://paddle-ci.cdn.bcebos.com/patchelf_0.10-2_amd64.deb && \
dpkg -i patchelf_0.10-2_amd64.deb
# Configure OpenSSH server. c.f. https://docs.docker.com/engine/examples/running_ssh_service
RUN mkdir /var/run/sshd && echo 'root:root' | chpasswd && sed -ri 's/^PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config && sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config
CMD source ~/.bashrc
# ccache 3.7.9
RUN wget https://paddle-ci.gz.bcebos.com/ccache-3.7.9.tar.gz && \
tar xf ccache-3.7.9.tar.gz && mkdir /usr/local/ccache-3.7.9 && cd ccache-3.7.9 && \
./configure -prefix=/usr/local/ccache-3.7.9 && \
make -j8 && make install && \
ln -s /usr/local/ccache-3.7.9/bin/ccache /usr/local/bin/ccache
RUN python3.8 -m pip install --upgrade pip==21.1.1 requests && \
python3.7 -m pip install --upgrade pip==21.1.1 requests && \
python3.6 -m pip install --upgrade pip==21.1.1 requests
RUN wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
tar xf centos_ssl.tar && rm -rf centos_ssl.tar && \
mv libcrypto.so.1.0.2k /usr/lib/libcrypto.so.1.0.2k && mv libssl.so.1.0.2k /usr/lib/libssl.so.1.0.2k && \
ln -sf /usr/lib/libcrypto.so.1.0.2k /usr/lib/libcrypto.so.10 && \
ln -sf /usr/lib/libssl.so.1.0.2k /usr/lib/libssl.so.10 && \
ln -sf /usr/lib/libcrypto.so.10 /usr/lib/libcrypto.so && \
ln -sf /usr/lib/libssl.so.10 /usr/lib/libssl.so
EXPOSE 22
...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/ ...@@ -83,7 +83,7 @@ RUN ln -sf /usr/local/bin/python3.6 /usr/local/bin/python3 && ln -sf /usr/local/
RUN rm -r /root/python_build RUN rm -r /root/python_build
# Install Go and glide # Install Go and glide
RUN wget -qO- https://dl.google.com/go/go1.14.linux-amd64.tar.gz | \ RUN wget -qO- https://paddle-ci.cdn.bcebos.com/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \ tar -xz -C /usr/local && \
mkdir /root/go && \ mkdir /root/go && \
mkdir /root/go/bin && \ mkdir /root/go/bin && \
......
unset GREP_OPTIONS
function install_trt(){
CUDA_VERSION=$(nvcc --version | egrep -o "V[0-9]+.[0-9]+" | cut -c2-)
if [ $CUDA_VERSION == "10.2" ]; then
wget https://paddle-ci.gz.bcebos.com/TRT/TensorRT6-cuda10.2-cudnn7.tar.gz --no-check-certificate
tar -zxf TensorRT6-cuda10.2-cudnn7.tar.gz -C /usr/local
cp -rf /usr/local/TensorRT-6.0.1.8/include/* /usr/include/ && cp -rf /usr/local/TensorRT-6.0.1.8/lib/* /usr/lib/
rm -rf TensorRT6-cuda10.2-cudnn7.tar.gz
elif [ $CUDA_VERSION == "11.2" ]; then
wget https://paddle-ci.gz.bcebos.com/TRT/TensorRT-8.0.3.4.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz --no-check-certificate
tar -zxf TensorRT-8.0.3.4.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz -C /usr/local
cp -rf /usr/local/TensorRT-8.0.3.4/include/* /usr/include/ && cp -rf /usr/local/TensorRT-8.0.3.4/lib/* /usr/lib/
rm -rf TensorRT-8.0.3.4.Linux.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz
else
echo "No Cuda Found, no need to install TensorRT"
fi
}
function env_install()
{
apt install -y libcurl4-openssl-dev libbz2-dev
wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && tar xf centos_ssl.tar && rm -rf centos_ssl.tar && mv libcrypto.so.1.0.2k /usr/lib/libcrypto.so.1.0.2k && mv libssl.so.1.0.2k /usr/lib/libssl.so.1.0.2k && ln -sf /usr/lib/libcrypto.so.1.0.2k /usr/lib/libcrypto.so.10 && ln -sf /usr/lib/libssl.so.1.0.2k /usr/lib/libssl.so.10 && ln -sf /usr/lib/libcrypto.so.10 /usr/lib/libcrypto.so && ln -sf /usr/lib/libssl.so.10 /usr/lib/libssl.so
rm -rf /usr/local/go && wget -qO- https://paddle-ci.gz.bcebos.com/go1.15.12.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \
mkdir /root/go && \
mkdir /root/go/bin && \
mkdir /root/go/src && \
echo "GOROOT=/usr/local/go" >> /root/.bashrc && \
echo "GOPATH=/root/go" >> /root/.bashrc && \
echo "PATH=/usr/local/go/bin:/root/go/bin:$PATH" >> /root/.bashrc
install_trt
}
env_install
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册