@@ -28,7 +28,7 @@ The goal of Paddle Serving is to provide high-performance, flexible and easy-to-
- Integrate high-performance server-side inference engine paddle Inference and mobile-side engine paddle Lite. Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle).
- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline.The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection]()
- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline.The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md)
- Support multiple [protocols]() such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK.
- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, etc.- Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, etc.; Integrate acceleration libraries of Intel MKLDNN and Nvidia TensorRT, and low-precision and quantitative inference.
- Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice.
...
...
@@ -54,46 +54,46 @@ The goal of Paddle Serving is to provide high-performance, flexible and easy-to-
This chapter guides you through the installation and deployment steps. It is strongly recommended to use Docker to deploy Paddle Serving. If you do not use docker, ignore the docker-related steps. Paddle Serving can be deployed on cloud servers using Kubernetes, running on many commonly hardwares such as ARM CPU, Intel CPU, Nvidia GPU, Kunlun XPU. The latest development kit of the develop branch is compiled and generated every day for developers to use.
-[Install Paddle Serving using docker](doc/Install.md)
-[Build Paddle Serving from Source with Docker](doc/COMPILE.md)
-[Deploy Paddle Serving on Kubernetes](doc/PADDLE_SERVING_ON_KUBERNETES.md)
-[Deploy Paddle Serving with Security gateway](doc/SERVING_AUTH_DOCKER.md)
-[Deploy Paddle Serving on more hardwares](doc/BAIDU_KUNLUN_XPU_SERVING.md)
-[Latest Wheel packages](doc/LATEST_PACKAGES.md)(Update everyday on branch develop)
-[Install Paddle Serving using docker](doc/Install_EN.md)
-[Build Paddle Serving from Source with Docker](doc/Compile_EN.md)
-[Deploy Paddle Serving on Kubernetes](doc/Run_On_Kubernetes_EN.md)
-[Deploy Paddle Serving with Security gateway](doc/Serving_Auth_Docker.md)
-[Deploy Paddle Serving on more hardwares](doc/Run_On_XPU_EN.md)
-[Latest Wheel packages](doc/Latest_Packages_CN.md)(Update everyday on branch develop)
> Use
The first step is to call the model save interface to generate a model parameter configuration file (.prototxt), which will be used on the client and server. The second step, read the configuration and startup parameters and start the service. According to API documents and your case, the third step is to write client requests based on the SDK, and test the inference service.
-[Quick Start](doc/QuickStart.md)
-[Save a servable model](doc/SAVE.md)
-[Description of configuration and startup parameters](doc/SERVING_CONFIGURE.md)
-[Guide for RESTful/gRPC/bRPC APIs](doc/HTTP_SERVICE_CN.md)
-[Infer on quantizative models](doc/LOW_PRECISION_DEPLOYMENT_CN.md)
-[Data format of classic models](doc/PROCESS_DATA.md)
@@ -132,7 +132,7 @@ If you want to communicate with developers and other users? Welcome to join us,
> Contribution
If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/CONTRIBUTE.md)
If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/Contribute_EN.md)
- Special Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command
- Special Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md).
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](Docker_Images_EN.md).
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes).
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](Compile_EN.md#Notes).
# Cube: Sparse Parameter Indexing Service (Local Mode)
([简体中文](./CUBE_LOCAL_CN.md)|English)
([简体中文](./Cube_Local_CN.md)|English)
## Overview
There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services.
The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos.
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) after reading this document (still developing).-->
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) after reading this document (still developing).-->
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md)
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md)
## Example
in directory python/example/criteo_ctr_with_cube, run
This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset.
We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](doc/RUN_IN_DOCKER.md). See the [document](doc/DOCKER_IMAGES.md) for more docker images.
We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](Run_In_Docker_EN.md). See the [document](Docker_Images_EN.md) for more docker images.
**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version.
You may need to use a domestic mirror source(in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download.
If you need install modules compiled with develop branch, please download packages from [latest packages list](./doc/LATEST_PACKAGES.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](./doc/COMPILE.md)
If you need install modules compiled with develop branch, please download packages from [latest packages list](Latest_Package_CN.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](Compile_EN.md)
Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10.
...
...
@@ -69,11 +68,11 @@ The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it an
the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`. More info please check [Paddle Serving uses TensorRT](./doc/TENSOR_RT.md)
the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`.
If it is other environment and Python version, please find the corresponding link in the table and install it with pip.
For **Windows Users**, please read the document [Paddle Serving for Windows Users](./doc/WINDOWS_TUTORIAL.md)
For **Windows Users**, please read the document [Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
## Obtain the quantized model through PaddleSlim tool
### Obtain the quantized model through PaddleSlim tool
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
-[Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
...
...
@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d
### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image:
Refer to [this document](Docker_Images_EN.md) for a docker image:
@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md)
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md).
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories.
For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package.
# Request parameters description
## Request parameters description
In order to deploy serving
service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment.
| param | param description | about |
...
...
@@ -64,13 +65,13 @@ In order to deploy serving
| use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite |
| use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option |
| ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
Some examples are provided below, and other models can be modifed with reference to these examples.
| sample name | sample links |
...
...
@@ -113,6 +114,6 @@ Some examples are provided below, and other models can be modifed with reference
Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time.
## Kunlun chip related reference materials
### Kunlun chip related reference materials
*[PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
*[Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
@@ -12,7 +12,7 @@ This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
## Model Configuration
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](SAVE.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
Example:
...
...
@@ -39,7 +39,7 @@ fetch_var {
- fetch_var:model output
- name:node name
- alias_name:alias name
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD.md)
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD_EN.md)
@@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating
Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems.
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md)》
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](Docker_Images_EN.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](Run_In_Docker_EN.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](Windows_Tutorial_EN.md)》
> Support multiple development languages client SDKs
Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR.
Paddle Serving provides 3 development language client SDKs, including Python, C++, Java, we hope that interested open source developers can help submit PR.
+ Python, Refer to the client example under python/examples or 4.2 web service example.
@@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C
Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)》
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》.
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](Save_EN.md)》.
```
feed_var {
name: "x"
...
...
@@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
<p>
### 3.3 Model Management and Hot Reloading
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples.
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](C++_Serving/Hot_Loading_EN.md) for specific examples.
### 3.4 MOEDL ENCRYPTION INFERENCE
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md)》
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](C++_Serving/Encryption_EN.md)》
### 3.5 A/B Test
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples.
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](C++_Serving/ABTEST_EN.md) for specific examples.
<palign="center">
<br>
...
...
@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
### 5.2 Core Design And Use Cases
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)》
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](Python_Pipeline/Pipeline_Design_EN.md)》
Deploying models trained on mainstream frameworks through the tensorRT tool launched by Nvidia can greatly increase the speed of model inference, which is often at least 1 times faster than the original framework, and it also takes up more device memory. less. Therefore, it is very useful for all users who need to deploy models to master the method of deploying deep learning models with tensorRT. Paddle Serving provides comprehensive TensorRT ecological support.
### surroundings
Serving Cuda10.1 Cuda10.2 and Cuda11 versions support TensorRT.
#### Install Paddle
In [Development using Docker environment](./RUN_IN_DOCKER.md) and [Docker image list](./DOCKER_IMAGES.md), we give the development image of TensorRT. After using the mirror to start, you need to install the Paddle whl package that supports TensorRT, refer to the documentation on the home page
```
# GPU Cuda10.2 environment please execute
pip install paddlepaddle-gpu==2.0.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
Select the URL link of the corresponding GPU environment and install it. For example, for Python2.7 users of Cuda 10.1, please select `cp27-cp27mu` and
`cuda10.1-cudnn7.6-trt6.0.1.5` corresponding url, copy it and execute
Since the default `paddlepaddle-gpu==2.0.0` is Cuda 10.2 and TensorRT is not built, if you need to use TensorRT on `paddlepaddle-gpu`, you need to find `cuda10 in the above multi-version whl package list .2-cudnn8.0-trt7.1.3`, download the corresponding Python version.
In [Serving model example](../python/examples), we have given models that can be accelerated using TensorRT, such as [Faster_RCNN model](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) under detection
The TensorRT version of the faster_rcnn model server is started
#### Local Predictor mode
In [local_predictor](../python/paddle_serving_app/local_predict.py#L52), users can explicitly specify `use_trt=True` and pass it to `load_model_config`.
Other methods are no different from other Local Predictor methods, and you need to pay attention to the compatibility of the model with TensorRT.
#### Pipeline Mode
In [Pipeline mode](./PIPELINE_SERVING.md), our [imagenet example](../python/examples/pipeline/imagenet/config.yml#L23) gives the way to set TensorRT.
@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./NEW_WEB_SERVICE.md)
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./C++_Serving/Http_Service_EN.md)