If you want to customize your Serving based on source code, use the version with the suffix - devel.
**cuda10.1-cudnn7-gcc54 image is not ready, you should run from dockerfile if you need it.**
| Description | OS | TAG | Dockerfile |
If you need to develop and compile based on the source code, please use the version with the suffix -devel.
**In the TAG column, 0.7.0 can also be replaced with the corresponding version number, such as 0.5.0/0.4.1, etc., but it should be noted that some development environments only increase with a certain version iteration, so not all environments All have the corresponding version number can be used.**
Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes.](./Run_On_Kubernetes_CN.md).
**Tips:** If you want to use CPU server and GPU server (version>=0.5.0) at the same time, you should check the gcc version, only Cuda10.1/10.2/11 can run with CPU server owing to the same gcc version(8.2).
**Strongly recommend** you build **Paddle Serving** in Docker, please check [How to run PaddleServing in Docker](Run_In_Docker_CN.md). For more images, please refer to [Docker Image List](Docker_Images_CN.md).
**Strongly recommend** you build **Paddle Serving** in Docker. For more images, please refer to [Docker Image List](Docker_Images_CN.md).
**Tip-1**: This project only supports <mark>**Python3.6/3.7/3.8**</mark>, all subsequent operations related to Python/Pip need to select the correct Python version.
**Tips:** If you want to use CPU server and GPU server at the same time, you should check the gcc version, only Cuda10.1/10.2/11 can run with CPU server owing to the same gcc version(8.2).
## Client
...
...
@@ -48,16 +49,16 @@ for kunlun user who uses arm-xpu or x86-xpu can download the wheel packages as f
@@ -18,7 +18,7 @@ Paddle Serving provides a user-friendly programming framework for multi-model co
The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
@@ -61,7 +61,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.
@@ -80,7 +80,7 @@ The graph execution engine consists of OPs and Channels, and the connected OPs s
- The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `python/examples/pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `Serving/examples/Pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
## Requirements
Docker (GPU version requires nvidia-docker to be installed on the GPU machine)
This document takes Python2 as an example to show how to run Paddle Serving in docker. You can also use Python3 to run related commands by replacing `python` with `python3`.
## CPU
### Get docker image
Refer to [this document](Docker_Images_EN.md) for a docker image:
docker run -p 9292:9292 --nametest-dit registry.baidubce.com/paddlepaddle/serving:latest-devel
docker exec-ittest bash
```
The `-p` option is to map the `9292` port of the container to the `9292` port of the host.
### Install PaddleServing
Please refer to the instructions on the homepage to download the pip package of the corresponding version.
## GPU
The GPU version is basically the same as the CPU version, with only some differences in interface naming (GPU version requires nvidia-docker to be installed on the GPU machine).
### Get docker image
Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
nvidia-docker run -p 9292:9292 --nametest-dit registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
nvidia-docker exec-ittest bash
```
or
```bash
docker run --gpus all -p 9292:9292 --nametest-dit registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
docker exec-ittest bash
```
The `-p` option is to map the `9292` port of the container to the `9292` port of the host.
### Install PaddleServing
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
@@ -26,6 +26,6 @@ The script converts the time-dot information in the log into a json format and s
Specific operation: Open the chrome browser, enter `chrome://tracing/` in the address bar, jump to the tracing page, click the `load` button, and open the saved trace file to visualize the time information of each stage of the prediction service.
The data visualization output is shown as follow, it uses [bert as service example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server.
The data visualization output is shown as follow, it uses [bert as service example](../C++/PaddleNLP/bert) GPU inference service. The server starts 4 GPU prediction, the client starts 4 `processes`, and the timeline of each stage when the batch size is 1. Among them, `bert_pre` represents the data preprocessing stage of the client, and `client_infer` represents the stage where the client completes sending and receiving prediction requests. `process` represents the process number of the client, and the second line of each process shows the timeline of each op of the server.
3.注意PipelineClientExample.java中的ip和port(位于java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),需要与对应Pipeline server的config.yaml文件中配置的ip和port相对应。(以IMDB model ensemble模型为例,位于python/examples/pipeline/imdb_model_ensemble/[config.yaml](../python/examples/pipeline/imdb_model_ensemble/config.yml))
3.注意PipelineClientExample.java中的ip和port(位于java/examples/src/main/java/[PipelineClientExample.java](./examples/src/main/java/PipelineClientExample.java)),需要与对应Pipeline server的config.yaml文件中配置的ip和port相对应。(以IMDB model ensemble模型为例,位于python/examples/pipeline/imdb_model_ensemble/[config.yaml](../examples/Pipeline/imdb_model_ensemble/config.yml))
- The image preprocessing method is more flexible than the above method, and can be combined by the following multiple classes,[example](../examples/imagenet/resnet50_rpc_client.py)
- The image preprocessing method is more flexible than the above method, and can be combined by the following multiple classes,[example](../../examples/C++/PaddleClas/imagenet/resnet50_rpc_client.py)
- class Sequentia
...
...
@@ -144,7 +144,7 @@ This tool is convenient to analyze the proportion of time occupancy in the predi
Load the trace file generated in the previous step through the load button, you can
Visualize the time information of each stage of the forecast service.
As shown in next figure, the figure shows the timeline of GPU prediction service using [bert example](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/bert).
As shown in next figure, the figure shows the timeline of GPU prediction service using [bert example](../../examples/C++/PaddleNLP/bert).
The server side starts service with 4 GPU cards, the client side starts 4 processes to request, and the batch size is 1.
In the figure, bert_pre represents the data pre-processing stage of the client, and client_infer represents the stage where the client completes the sending of the prediction request to the receiving result.
The process in the figure represents the process number of the client, and the second line of each process shows the timeline of each op of the server.
...
...
@@ -157,7 +157,7 @@ The inference op of Paddle Serving is implemented based on Paddle inference lib.
Before deploying the prediction service, you may need to check the input and output of the prediction service or check the resource consumption.
Therefore, a local prediction tool is built into the paddle_serving_app, which is used in the same way as sending a request to the server through the client.
Taking [fit_a_line prediction service](../examples/fit_a_line) as an example, the following code can be used to run local prediction.
Taking [fit_a_line prediction service](../../examples/C++/fit_a_line) as an example, the following code can be used to run local prediction.