未验证 提交 2003fdd3 编写于 作者: T TeslaZhao 提交者: GitHub

Merge pull request #1499 from TeslaZhao/develop

V0.7 Version Update Docs
此差异已折叠。
此差异已折叠。
......@@ -75,7 +75,7 @@ service ImageClassifyService {
#### 2.2.2 示例配置
关于Serving端的配置的详细信息,可以参考[Serving端配置](../SERVING_CONFIGURE_CN.md)
关于Serving端的配置的详细信息,可以参考[Serving端配置](../Serving_Configure_CN.md)
以下配置文件将ReaderOP, ClassifyOP和WriteJsonOP串联成一个workflow (关于OP/workflow等概念,可参考[OP介绍](OP_CN.md)[DAG介绍](DAG_CN.md))
......
......@@ -32,7 +32,7 @@ Server端的核心是一个由项目代码编译产生的名称为serving的二
为了方便用户快速的启动C++ Serving的Server端,除了用户自行修改配置文件并通过命令行传参运行serving二进制可执行文件以外,我们也提供了另外一种通过python脚本启动的方式。python脚本启动本质上仍是运行serving二进制可执行文件,但python脚本中会自动完成两件事:1、配置文件的生成;2、根据需要配置的参数,生成命令行,通过命令行的方式,传入参数信息并运行serving二进制可执行文件。
更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../SERVING_CONFIGURE_CN.md)
更多详细说明和示例,请参考[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
### 3.2 同步/异步模式
同步模式比较简单直接,适用于模型预测时间短,单个Request请求的batch已经比较大的情况。
......
# 如何编译PaddleServing
(简体中文|[English](./COMPILE.md))
(简体中文|[English](./Compile_EN.md))
## 编译环境设置
......
# How to compile PaddleServing
([简体中文](./COMPILE_CN.md)|English)
([简体中文](./Compile_CN.md)|English)
## Compilation environment requirements
......@@ -23,7 +23,7 @@
| libSM | 1.2.2 |
| libXrender | 0.9.10 |
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](DOCKER_IMAGES.md).
It is recommended to use Docker for compilation. We have prepared the Paddle Serving compilation environment for you, see [this document](Docker_Images_EN.md).
## Get Code
......@@ -159,8 +159,7 @@ cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
-DSERVER=ON ..
make -j10
```
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md#Notes).
**Note:** After the compilation is successful, you need to set the `SERVING_BIN` path, see the following [Notes](Compile_EN.md#Notes).
## Compile Client
......
......@@ -68,7 +68,7 @@ Paddle Serving uses this [Git branching model](http://nvie.com/posts/a-successfu
1. Build and test
Users can build Paddle Serving natively on Linux, see the [BUILD steps](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md).
Users can build Paddle Serving natively on Linux, see the [BUILD steps](Compile_EN.md).
1. Keep pulling
......
# 稀疏参数索引服务Cube单机版使用指南
(简体中文|[English](./CUBE_LOCAL.md))
(简体中文|[English](./Cube_Local_EN.md))
## 引言
在python/examples下有两个关于CTR的示例,他们分别是criteo_ctr, criteo_ctr_with_cube。前者是在训练时保存整个模型,包括稀疏参数。后者是将稀疏参数裁剪出来,保存成两个部分,一个是稀疏参数,另一个是稠密参数。由于在工业级的场景中,稀疏参数的规模非常大,达到10^9数量级。因此在一台机器上启动大规模稀疏参数预测是不实际的,因此我们引入百度多年来在稀疏参数索引领域的工业级产品Cube,提供分布式的稀疏参数服务。
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](CUBE_LOCAL_CN.md)(正在建设中)。-->
<!--单机版Cube是分布式Cube的弱化版本,旨在方便开发者做实验和Demo时使用。如果有分布式稀疏参数服务的需求,请在读完此文档之后,继续阅读 [稀疏参数索引服务Cube使用指南](Cube_Local_CN.md)(正在建设中)。-->
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./CUBE_QUANT_CN.md)
本文档使用的都是未经过任何压缩算法处理的原始模型,如果有量化模型上线需求,请阅读[Cube稀疏参数索引量化存储使用指南](./Cube_Quant_CN.md)
## 示例
......
# Cube: Sparse Parameter Indexing Service (Local Mode)
([简体中文](./CUBE_LOCAL_CN.md)|English)
([简体中文](./Cube_Local_CN.md)|English)
## Overview
There are two examples on CTR under python / examples, they are criteo_ctr, criteo_ctr_with_cube. The former is to save the entire model during training, including sparse parameters. The latter is to cut out the sparse parameters and save them into two parts, one is the sparse parameter and the other is the dense parameter. Because the scale of sparse parameters is very large in industrial cases, reaching the order of 10 ^ 9. Therefore, it is not practical to start large-scale sparse parameter prediction on one machine. Therefore, we introduced Baidu's industrial-grade product Cube to provide the sparse parameter service for many years to provide distributed sparse parameter services.
The local mode of Cube is different from distributed Cube, which is designed to be convenient for developers to use in experiments and demos.
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md) after reading this document (still developing).-->
<!--If there is a demand for distributed sparse parameter service, please continue reading [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md) after reading this document (still developing).-->
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./CUBE_QUANT.md)
This document uses the original model without any compression algorithm. If there is a need for a quantitative model to go online, please read the [Quantization Storage on Cube Sparse Parameter Indexing](./Cube_Quant_EN.md)
## Example
in directory python/example/criteo_ctr_with_cube, run
......
# Cube稀疏参数索引量化存储使用指南
(简体中文|[English](./CUBE_QUANT.md))
(简体中文|[English](./Cube_Quant_EN.md))
## 总体概览
......@@ -9,7 +9,7 @@
## 前序要求
请先读取 [稀疏参数索引服务Cube单机版使用指南](./CUBE_LOCAL_CN.md)
请先读取 [稀疏参数索引服务Cube单机版使用指南](./Cube_Local_CN.md)
## 组件介绍
......
# Quantization Storage on Cube Sparse Parameter Indexing
([简体中文](./CUBE_QUANT_CN.md)|English)
([简体中文](./Cube_Quant_CN.md)|English)
## Overview
......@@ -8,7 +8,7 @@ In our previous article, we know that the sparse parameter is a series of floati
## Precondition
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./CUBE_LOCAL_CN.md)
Please Read [Cube: Sparse Parameter Indexing Service (Local Mode)](./Cube_Local_EN.md)
## Components
......
......@@ -13,7 +13,7 @@
### 预备知识
- 需要会编译Paddle Serving,参见[编译文档](./COMPILE.md)
- 需要会编译Paddle Serving,参见[编译文档](./Compile_EN.md)
### 用法
......
# Docker 镜像
(简体中文|[English](DOCKER_IMAGES.md))
(简体中文|[English](Docker_Images_EN.md))
该文档维护了 Paddle Serving 提供的镜像列表。
......
# Docker Images
([简体中文](DOCKER_IMAGES_CN.md)|English)
([简体中文](Docker_Images_CN.md)|English)
This document maintains a list of docker images provided by Paddle Serving.
......
......@@ -142,7 +142,7 @@ make: *** [all] Error 2
#### Q:使用过程中出现CXXABI错误。
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./RUN_IN_DOCKER_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](./Run_In_Docker_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
```bash
wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz
......@@ -198,7 +198,7 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
(1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](DOCKER_IMAGES_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
......@@ -232,15 +232,15 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is:
#### Q: 目前Paddle Serving支持哪些镜像环境?
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/DOCKER_IMAGES.md)
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
#### Q: python编译的GCC版本与serving的版本不匹配
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/RUN_IN_DOCKER.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
**A:**:1)使用[GPU docker](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Run_In_Docker_CN.md#gpunvidia-docker)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
#### Q: paddle-serving是否支持本地离线安装
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/COMPILE.md)提前准备安装好
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好
#### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。
......@@ -280,7 +280,7 @@ client.connect(["127.0.0.1:9393"])
#### Q: 如何使用多语言客户端
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/JAVA_SDK_CN.md)
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md)
#### Q: 如何在Windows下使用Paddle Serving
......
# 如何在Paddle Serving使用Go Client
(简体中文|[English](./IMDB_GO_CLIENT.md))
(简体中文|[English](./Imdb_GO_Client_EN.md))
本文档说明了如何将Go用作客户端语言。对于Paddle Serving中的Go客户端,提供了一个简单的客户端程序包https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, 用户可以根据需要引用该程序包。这是一个基于IMDB数据集的情感分析任务的简单示例。
......
# How to use Go Client of Paddle Serving
([简体中文](./IMDB_GO_CLIENT_CN.md)|English)
([简体中文](./Imdb_GO_Client_CN.md)|English)
This document shows how to use Go as your client language. For Go client in Paddle Serving, a simple client package is provided https://github.com/PaddlePaddle/Serving/tree/develop/go/serving_client, a user can import this package as needed. Here is a simple example of sentiment analysis task based on IMDB dataset.
......
# 使用Docker安装Paddle Serving
(简体中文|[English](./Install_EN.md))
**强烈建议**您在**Docker内构建**Paddle Serving,请查看[如何在Docker中运行PaddleServing](Run_In_Docker_CN.md)。更多镜像请查看[Docker镜像列表](Docker_Images_CN.md)
**提示**:目前paddlepaddle 2.1版本的默认GPU环境是Cuda 10.2,因此GPU Docker的示例代码以Cuda 10.2为准。镜像和pip安装包也提供了其余GPU环境,用户如果使用其他环境,需要仔细甄别并选择合适的版本。
**提示**:本项目仅支持Python3.6/3.7/3.8,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。
```
# 启动 CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# 启动 GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.2-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
安装所需的pip依赖
```
cd Serving
pip3 install -r python/requirements.txt
```
```shell
pip3 install paddle-serving-client==0.6.2
pip3 install paddle-serving-server==0.6.2 # CPU
pip3 install paddle-serving-app==0.6.2
pip3 install paddle-serving-server-gpu==0.6.2.post102 #GPU with CUDA10.2 + TensorRT7
# 其他GPU环境需要确认环境再选择执行哪一条
pip3 install paddle-serving-server-gpu==0.6.2.post101 # GPU with CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.6.2.post11 # GPU with CUDA10.1 + TensorRT7
```
您可能需要使用国内镜像源(例如清华源, 在pip命令中添加`-i https://pypi.tuna.tsinghua.edu.cn/simple`)来加速下载。
如果需要使用develop分支编译的安装包,请从[最新安装包列表](Latest_Packages_CN.md)中获取下载地址进行下载,使用`pip install`命令进行安装。如果您想自行编译,请参照[Paddle Serving编译文档](Compile_CN.md)
paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubuntu 16/18和Windows 10。
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8。
**最新的0.6.2的版本,已经不支持Cuda 9.0和Cuda 10.0,Python已不支持2.7和3.5。**
推荐安装2.1.0及以上版本的paddle
```
# CPU环境请执行
pip3 install paddlepaddle==2.1.0
# GPU Cuda10.2环境请执行
pip3 install paddlepaddle-gpu==2.1.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`对应的url,复制下来并执行
```
pip3 install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.1.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本
如果是其他环境和Python版本,请在表格中找到对应的链接并用pip安装。
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
# Install Paddle Serving with Docker
([简体中文](Install_CN.md)|English)
We **highly recommend** you to **run Paddle Serving in Docker**, please visit [Run in Docker](Run_In_Docker_EN.md). See the [document](Docker_Images_EN.md) for more docker images.
**Attention:**: Currently, the default GPU environment of paddlepaddle 2.1 is Cuda 10.2, so the sample code of GPU Docker is based on Cuda 10.2. We also provides docker images and whl packages for other GPU environments. If users use other environments, they need to carefully check and select the appropriate version.
**Attention:** the following so-called 'python' or 'pip' stands for one of Python 3.6/3.7/3.8.
```
# Run CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
```
# Run GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.6.0-cuda10.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
install python dependencies
```
cd Serving
pip install -r python/requirements.txt
```
```shell
pip install paddle-serving-client==0.6.0
pip install paddle-serving-server==0.6.0 # CPU
pip install paddle-serving-app==0.6.0
pip install paddle-serving-server-gpu==0.6.0.post102 #GPU with CUDA10.2 + TensorRT7
# DO NOT RUN ALL COMMANDS! check your GPU env and select the right one
pip install paddle-serving-server-gpu==0.6.0.post101 # GPU with CUDA10.1 + TensorRT6
pip install paddle-serving-server-gpu==0.6.0.post11 # GPU with CUDA10.1 + TensorRT7
You may need to use a domestic mirror source (in China, you can use the Tsinghua mirror source, add `-i https://pypi.tuna.tsinghua.edu.cn/simple` to pip command) to speed up the download.
If you need install modules compiled with develop branch, please download packages from [latest packages list](Latest_Package_CN.md) and install with `pip install` command. If you want to compile by yourself, please refer to [How to compile Paddle Serving?](Compile_EN.md)
Packages of paddle-serving-server and paddle-serving-server-gpu support Centos 6/7, Ubuntu 16/18, Windows 10.
Packages of paddle-serving-client and paddle-serving-app support Linux and Windows, but paddle-serving-client only support python3.6/3.7/3.8.
**For latest version, Cuda 9.0 or Cuda 10.0 are no longer supported, Python2.7/3.5 is no longer supported.**
Recommended to install paddle >= 2.1.0
```
# CPU users, please run
pip install paddlepaddle==2.1.0
# GPU Cuda10.2 please run
pip install paddlepaddle-gpu==2.1.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the url link of the corresponding GPU environment and install it. For example, for Python3.6 users of Cuda 10.1, please select `cp36-cp36m` and
The url corresponding to `cuda10.1-cudnn7-mkl-gcc8.2-avx-trt6.0.1.5`, copy it and run
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.1.0-gpu-cuda10.1-cudnn7-mkl-gcc8.2/paddlepaddle_gpu-2.1.0.post101-cp36-cp36m-linux_x86_64.whl
```
the default `paddlepaddle-gpu==2.1.0` is Cuda 10.2 with no TensorRT. If you want to install PaddlePaddle with TensorRT. please also check the documentation-multi-version whl package list and find key word `cuda10.2-cudnn8.0-trt7.1.3`.
If it is other environment and Python version, please find the corresponding link in the table and install it with pip.
For **Windows Users**, please read the document [Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
<h2 align="center">Quick Start Example</h2>
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
# Paddle Serving Client Java SDK
(简体中文|[English](JAVA_SDK.md))
(简体中文|[English](Java_SDK_EN.md))
Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测,本文档说明了如何使用 Java SDK。
......
# Paddle Serving Client Java SDK
([简体中文](JAVA_SDK_CN.md)|English)
([简体中文](Java_SDK_CN.md)|English)
Paddle Serving provides Java SDK,which supports predict on the Client side with Java language. This document shows how to use the Java SDK.
......
# Lod字段说明
(简体中文|[English](LOD.md))
(简体中文|[English](LOD_EN.md))
## 概念
......
# Paddle Serving低精度部署
(简体中文|[English](./LOW_PRECISION_DEPLOYMENT.md))
## Paddle Serving低精度部署
(简体中文|[English](./Low_Precision_EN.md))
低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。
## 通过PaddleSlim量化生成低精度模型
### 通过PaddleSlim量化生成低精度模型
详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html)
## 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
### 使用TensorRT int8加载PaddleSlim Int8量化模型进行部署
首先下载Resnet50 [PaddleSlim量化模型](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz),并转换为Paddle Serving支持的部署模型格式。
```
wget https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz
......@@ -40,7 +41,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1))
```
## 参考文档
### 参考文档
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Low-Precision Deployment for Paddle Serving
(English|[简体中文](./LOW_PRECISION_DEPLOYMENT_CN.md))
## Low-Precision Deployment for Paddle Serving
(English|[简体中文](./Low_Precision_CN.md))
Intel CPU supports int8 and bfloat16 models, NVIDIA TensorRT supports int8 and float16 models.
## Obtain the quantized model through PaddleSlim tool
### Obtain the quantized model through PaddleSlim tool
Train the low-precision models please refer to [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html).
## Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
### Deploy the quantized model from PaddleSlim using Paddle Serving with Nvidia TensorRT int8 mode
Firstly, download the [Resnet50 int8 model](https://paddle-inference-dist.bj.bcebos.com/inference_demo/python/resnet50/ResNet50_quant.tar.gz) and convert to Paddle Serving's saved model。
```
......@@ -41,7 +42,7 @@ fetch_map = client.predict(feed={"image": img}, fetch=["save_infer_model/scale_0
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
```
## Reference
### Reference
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* [Deploy the quantized model Using Paddle Inference on Intel CPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* [Deploy the quantized model Using Paddle Inference on Nvidia GPU](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Model Zoo
([English](./Model_Zoo.md)|简体中文)
([English](./Model_Zoo_EN.md)|简体中文)
本页面展示了Paddle Serving目前支持的预训练模型以及下载链接
若您想为Paddle Serving提供新的模型,可通过[pull request](https://github.com/PaddlePaddle/Serving/pulls)提交PR
......
# Pipeline Serving
(简体中文|[English](PIPELINE_SERVING.md))
- [架构设计](PIPELINE_SERVING_CN.md#1架构设计)
- [详细设计](PIPELINE_SERVING_CN.md#2详细设计)
- [典型示例](PIPELINE_SERVING_CN.md#3典型示例)
- [高阶用法](PIPELINE_SERVING_CN.md#4高阶用法)
- [日志追踪](PIPELINE_SERVING_CN.md#5日志追踪)
- [性能分析与优化](PIPELINE_SERVING_CN.md#6性能分析与优化)
(简体中文|[English](Pipeline_Design_EN.md))
- [架构设计](Pipeline_Design_CN.md#1架构设计)
- [详细设计](Pipeline_Design_CN.md#2详细设计)
- [典型示例](Pipeline_Design_CN.md#3典型示例)
- [高阶用法](Pipeline_Design_CN.md#4高阶用法)
- [日志追踪](Pipeline_Design_CN.md#5日志追踪)
- [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
......
# Pipeline Serving
([简体中文](PIPELINE_SERVING_CN.md)|English)
- [Architecture Design](PIPELINE_SERVING.md#1architecture-design)
- [Detailed Design](PIPELINE_SERVING.md#2detailed-design)
- [Classic Examples](PIPELINE_SERVING.md#3classic-examples)
- [Advanced Usages](PIPELINE_SERVING.md#4advanced-usages)
- [Log Tracing](PIPELINE_SERVING.md#5log-tracing)
- [Performance Analysis And Optimization](PIPELINE_SERVING.md#6performance-analysis-and-optimization)
([简体中文](Pipeline_Design_CN.md)|English)
- [Architecture Design](Pipeline_Design_EN.md#1architecture-design)
- [Detailed Design](Pipeline_Design_EN.md#2detailed-design)
- [Classic Examples](Pipeline_Design_EN.md#3classic-examples)
- [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages)
- [Log Tracing](Pipeline_Design_EN.md#5log-tracing)
- [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
......
## Paddle Serving 快速开始示例
([English](./Quick_Start_EN.md)|简体中文)
这个快速开始示例主要是为了给那些已经有一个要部署的模型的用户准备的,而且我们也提供了一个可以用来部署的模型。如果您想知道如何从离线训练到在线服务走完全流程,请参考前文的AiStudio教程。
<h3 align="center">波士顿房价预测</h3>
进入到Serving的git目录下,进入到`fit_a_line`例子
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
<h3 align="center">RPC服务</h3>
用户还可以使用`paddle_serving_server.serve`启动RPC服务。 尽管用户需要基于Paddle Serving的python客户端API进行一些开发,但是RPC服务通常比HTTP服务更快。需要指出的是这里我们没有指定`--name`
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
#### 异步模型的说明
异步模式适用于1、请求数量非常大的情况,2、多模型串联,想要分别指定每个模型的并发数的情况。
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### 当您的一个服务包含两个模型部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8
</center>
``` python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
在这里,`client.predict`函数具有两个参数。 `feed`是带有模型输入变量别名和值的`python dict``fetch`被要从服务器返回的预测变量赋值。 在该示例中,在训练过程中保存可服务模型时,被赋值的tensor名为`"x"``"price"`
<h3 align="center">HTTP服务</h3>
用户也可以将数据格式处理逻辑放在服务器端进行,这样就可以直接用curl去访问服务,参考如下案例,在目录`python/examples/fit_a_line`.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
客户端输入
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
返回结果
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline服务</h3>
Paddle Serving提供业界领先的多模型串联服务,强力支持各大公司实际运行的业务场景,参考 [OCR文字识别案例](python/examples/pipeline/ocr),在目录`python/examples/pipeline/ocr`
我们先获取两个模型
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
然后启动服务端程序,将两个串联的模型作为一个整体的服务。
```
python3 web_service.py
```
最终使用http的方式请求
```
python3 pipeline_http_client.py
```
也支持rpc的方式
```
python3 pipeline_rpc_client.py
```
输出
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
## Paddle Serving Quick Start Examples
(English|[简体中文](./Quick_Start_CN.md))
This quick start example is mainly for those users who already have a model to deploy, and we also provide a model that can be used for deployment. in case if you want to know how to complete the process from offline training to online service, please refer to the AiStudio tutorial above.
### Boston House Price Prediction model
get into the Serving git directory, and change dir to `fit_a_line`
``` shell
cd Serving/python/examples/fit_a_line
sh get_data.sh
```
Paddle Serving provides HTTP and RPC based service for users to access
### RPC service
A user can also start a RPC service with `paddle_serving_server.serve`. RPC service is usually faster than HTTP service, although a user needs to do some coding based on Paddle Serving's python client API. Note that we do not specify `--name` here.
``` shell
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
<center>
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `4` | Concurrency of current service |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str | `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Only for deployment with TensorRT |
</center>
```python
# A user can visit rpc service through paddle_serving_client API
from paddle_serving_client import Client
import numpy as np
client = Client()
client.load_client_config("uci_housing_client/serving_client_conf.prototxt")
client.connect(["127.0.0.1:9292"])
data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
-0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
Here, `client.predict` function has two arguments. `feed` is a `python dict` with model input variable alias name and values. `fetch` assigns the prediction variables to be returned from servers. In the example, the name of `"x"` and `"price"` are assigned when the servable model is saved during training.
### WEB service
Users can also put the data format processing logic on the server side, so that they can directly use curl to access the service, refer to the following case whose path is `python/examples/fit_a_line`
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --name uci
```
for client side,
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://127.0.0.1:9292/uci/prediction
```
the response is
```
{"result":{"price":[[18.901151657104492]]}}
```
<h3 align="center">Pipeline Service</h3>
Paddle Serving provides industry-leading multi-model tandem services, which strongly supports the actual operating business scenarios of major companies, please refer to [OCR word recognition](./python/examples/pipeline/ocr).
we get two models
```
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
```
then we start server side, launch two models as one standalone web service
```
python3 web_service.py
```
http request
```
python3 pipeline_http_client.py
```
grpc request
```
python3 pipeline_rpc_client.py
```
output
```
{'err_no': 0, 'err_msg': '', 'key': ['res'], 'value': ["['土地整治与土壤修复研究中心', '华南农业大学1素图']"]}
```
\ No newline at end of file
# 如何在Docker中运行PaddleServing
(简体中文|[English](RUN_IN_DOCKER.md))
(简体中文|[English](Run_In_Docker_EN.md))
Docker最大的好处之一就是可移植性,可在多种操作系统和主流的云计算平台部署。使用Paddle Serving Docker镜像可在Linux、Mac和Windows平台部署。
......@@ -14,7 +14,7 @@ Docker(GPU版本需要在GPU机器上安装nvidia-docker)
### 获取镜像
参考[该文档](DOCKER_IMAGES_CN.md)获取镜像:
参考[该文档](Docker_Images_CN.md)获取镜像:
以CPU编译镜像为例
......@@ -59,9 +59,9 @@ docker exec -it test bash
### 安装PaddleServing
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](LATEST_PACKAGES.md)
请参照首页的指导,下载对应版本的pip包。[最新安装包合集](Latest_Packages_CN.md)
## 注意事项
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](COMPILE.md)
- 运行时镜像不能用于开发编译。如果想要从源码编译,请查看[如何编译PaddleServing](Compile_CN.md)
- 由于Cuda10和Cuda9的环境受限于GCC版本,无法同时运行CPU版本的`paddle_serving_server`,因此如果想要在GPU环境中同时使用CPU版本的`paddle_serving_server`,请选择Cuda10.1,Cuda10.2和Cuda11版本的镜像。
# How to run PaddleServing in Docker
([简体中文](RUN_IN_DOCKER_CN.md)|English)
([简体中文](Run_In_Docker_CN.md)|English)
One of the biggest benefits of Docker is portability, which can be deployed on multiple operating systems and mainstream cloud computing platforms. The Paddle Serving Docker image can be deployed on Linux, Mac and Windows platforms.
......@@ -14,7 +14,7 @@ This document takes Python2 as an example to show how to run Paddle Serving in d
### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image:
Refer to [this document](Docker_Images_EN.md) for a docker image:
```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-devel
......@@ -41,7 +41,7 @@ The GPU version is basically the same as the CPU version, with only some differe
### Get docker image
Refer to [this document](DOCKER_IMAGES.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
Refer to [this document](Docker_Images_EN.md) for a docker image, the following is an example of an `cuda9.0-cudnn7` image:
```shell
docker pull registry.baidubce.com/paddlepaddle/serving:latest-cuda10.2-cudnn8-devel
......@@ -67,9 +67,9 @@ The `-p` option is to map the `9292` port of the container to the `9292` port of
The mirror comes with `paddle_serving_server_gpu`, `paddle_serving_client`, and `paddle_serving_app` corresponding to the mirror tag version. If users don’t need to change the version, they can use it directly, which is suitable for environments without extranet services.
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./LATEST_PACKAGES.md)
If you need to change the version, please refer to the instructions on the homepage to download the pip package of the corresponding version. [LATEST_PACKAGES](./Latest_Packages_CN.md)
## Precautious
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](COMPILE.md).
- Runtime images cannot be used for compilation. If you want to compile from source, refer to [COMPILE](Compile_EN.md).
- If you use Cuda9 and Cuda10 docker images, you cannot use `paddle_serving_server` CPU version at the same time, due to the limitation of gcc version. If you want to use both in one docker image, please choose images of Cuda10.1, Cuda10.2 and Cuda11.
# Paddle Serving使用百度昆仑芯片部署
(简体中文|[English](./BAIDU_KUNLUN_XPU_SERVING.md))
## Paddle Serving使用百度昆仑芯片部署
(简体中文|[English](./Run_On_XPU_EN.md))
Paddle Serving支持使用百度昆仑芯片进行预测部署。目前支持在百度昆仑芯片和arm服务器(如飞腾 FT-2000+/64), 或者百度昆仑芯片和Intel CPU服务器,上进行部署,后续完善对其他异构硬件服务器部署能力。
# 编译、安装
基本环境配置可参考[该文档](COMPILE_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
## 编译
## 编译、安装
基本环境配置可参考[该文档](Compile_CN.md)进行配置。下面以飞腾FT-2000+/64机器为例进行介绍。
### 编译
* 编译server部分
```
cd Serving
......@@ -50,23 +50,23 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10
```
## 安装wheel包
### 安装wheel包
以上编译步骤完成后,会在各自编译目录$build_dir/python/dist生成whl包,分别安装即可。例如server步骤,会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
# 请求参数说明
## 请求参数说明
为了支持arm+xpu服务部署,使用Paddle-Lite加速能力,请求时需使用以下参数。
| 参数 | 参数说明 | 备注 |
| :------- | :-------------------------- | :--------------------------------------------------------------- |
| use_lite | 使用Paddle-Lite Engine | 使用Paddle-Lite cpu预测能力 |
| use_xpu | 使用Baidu Kunlun进行预测 | 该选项需要与use_lite配合使用 |
| ir_optim | 开启Paddle-Lite计算子图优化 | 详细见[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# 部署使用示例
## 下载模型
## 部署使用示例
### 下载模型
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz
```
## 启动rpc服务
### 启动rpc服务
主要有三种启动配置:
* 使用cpu+xpu部署,使用Paddle-Lite xpu优化加速能力;
* 单独使用cpu部署,使用Paddle-Lite优化加速能力;
......@@ -86,7 +86,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --po
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
```
## client调用
### client调用
```
from paddle_serving_client import Client
import numpy as np
......@@ -98,9 +98,9 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
# 其他说明
## 其他说明
## 模型实例及说明
### 模型实例及说明
以下提供部分样例,其他模型可参照进行修改。
| 示例名称 | 示例链接 |
| :--------- | :---------------------------------------------------------- |
......@@ -108,6 +108,7 @@ print(fetch_map)
| resnet | [resnet_v2_50_xpu](../python/examples/xpu/resnet_v2_50_xpu) |
注:支持昆仑芯片部署模型列表见[链接](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html)。不同模型适配上存在差异,可能存在不支持的情况,部署使用存在问题时,欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues),我们会实时跟进。
## 昆仑芯片支持相关参考资料
### 昆仑芯片支持相关参考资料
* [昆仑XPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [PaddleLite使用百度XPU预测部署](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# Paddle Serving Using Baidu Kunlun Chips
(English|[简体中文](./BAIDU_KUNLUN_XPU_SERVING_CN.md))
## Paddle Serving Using Baidu Kunlun Chips
(English|[简体中文](./Run_On_XPU_CN.md))
Paddle serving supports deployment using Baidu Kunlun chips. Currently, it supports deployment on the ARM CPU server with Baidu Kunlun chips
(such as Phytium FT-2000+/64), or Intel CPU with Baidu Kunlun chips. We will improve
the deployment capability on various heterogeneous hardware servers in the future.
# Compilation and installation
Refer to [compile](COMPILE.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
## Compilatiton
## Compilation and installation
Refer to [compile](Compile.md) document to setup the compilation environment. The following is based on FeiTeng FT-2000 +/64 platform.
### Compilatiton
* Compile the Serving Server
```
cd Serving
......@@ -52,11 +53,11 @@ cmake -DPYTHON_INCLUDE_DIR=/usr/include/python3.7m/ \
make -j10
```
## Install the wheel package
### Install the wheel package
After the compilations stages above, the whl package will be generated in ```python/dist/``` under the specific temporary directories.
For example, after the Server Compiation step,the whl package will be produced under the server-build-arm/python/dist directory, and you can run ```pip install -u python/dist/*.whl``` to install the package.
# Request parameters description
## Request parameters description
In order to deploy serving
service on the arm server with Baidu Kunlun xpu chips and use the acceleration capability of Paddle-Lite,please specify the following parameters during deployment.
| param | param description | about |
......@@ -64,13 +65,13 @@ In order to deploy serving
| use_lite | using Paddle-Lite Engine | use the inference capability of Paddle-Lite |
| use_xpu | using Baidu Kunlun for inference | need to be used with the use_lite option |
| ir_optim | open the graph optimization | refer to[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) |
# Deplyment examples
## Download the model
## Deplyment examples
### Download the model
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz
```
## Start RPC service
### Start RPC service
There are mainly three deployment methods:
* deploy on the cpu server with Baidu xpu using the acceleration capability of Paddle-Lite and xpu;
* deploy on the cpu server standalone with Paddle-Lite;
......@@ -90,7 +91,7 @@ Start the rpc service, deploying on cpu server.
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 6 --port 9292
```
##
###
```
from paddle_serving_client import Client
import numpy as np
......@@ -102,8 +103,8 @@ data = [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727,
fetch_map = client.predict(feed={"x": np.array(data).reshape(1,13,1)}, fetch=["price"])
print(fetch_map)
```
# Others
## Model example and explanation
## Others
### Model example and explanation
Some examples are provided below, and other models can be modifed with reference to these examples.
| sample name | sample links |
......@@ -113,6 +114,6 @@ Some examples are provided below, and other models can be modifed with reference
Note:Supported model lists refer to [doc](https://paddlelite.paddlepaddle.org.cn/introduction/support_model_list.html). There are differences in the adaptation of different models, and there may be some unsupported cases. If you have any problem,please submit [Github issue](https://github.com/PaddlePaddle/Serving/issues), and we will follow up in real time.
## Kunlun chip related reference materials
### Kunlun chip related reference materials
* [PaddlePaddle on Baidu Kunlun xpu chips](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/xpu_docs/index_cn.html)
* [Deployment on Baidu Kunlun xpu chips using PaddleLite](https://paddlelite.paddlepaddle.org.cn/demo_guides/baidu_xpu.html)
# 怎样保存用于Paddle Serving的模型?
(简体中文|[English](./SAVE.md))
(简体中文|[English](./Save_EN.md))
## 从已保存的模型文件中导出
如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
......
# How to save a servable model of Paddle Serving?
([简体中文](./SAVE_CN.md)|English)
([简体中文](./Save_CN.md)|English)
## Export from saved model files
......
......@@ -63,7 +63,7 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
### Step 1:启动Serving服务
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./PADDLE_SERVING_ON_KUBERNETES.md)
我们仍然以 [Uci房价预测](../python/examples/fit_a_line)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [在Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes.md)
在这里我们直接执行
```
......
# Serving Configuration
(简体中文|[English](SERVING_CONFIGURE.md))
(简体中文|[English](Serving_Configure_EN.md))
## 简介
......
# Serving Configuration
([简体中文](SERVING_CONFIGURE_CN.md)|English)
([简体中文](Serving_Configure_CN.md)|English)
## Overview
......@@ -12,7 +12,7 @@ This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
## Model Configuration
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](SAVE.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
Example:
......@@ -39,7 +39,7 @@ fetch_var {
- fetch_var:model output
- name:node name
- alias_name:alias name
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD.md)
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD_EN.md)
- feed_type/fetch_type:data type
|feed_type|类型|
......
# Paddle Serving设计文档
(简体中文|[English](./DESIGN_DOC.md))
(简体中文|[English](./Serving_Design_EN.md))
## 1. 设计目标
......@@ -55,15 +55,15 @@ Paddle Serving从做顶层设计时考虑到不同团队在工业级场景中会
> 跨平台运行
跨平台是不依赖于操作系统,也不依赖硬件环境。一个操作系统下开发的应用,放到另一个操作系统下依然可以运行。因此,设计上既要考虑开发语言、组件是跨平台的,同时也要考虑不同系统上编译器的解释差异。
Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](DOCKER_IMAGES_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](RUN_IN_DOCKER_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](WINDOWS_TUTORIAL_CN.md)
Docker 是一个开源的应用容器引擎,让开发者可以打包他们的应用以及依赖包到一个可移植的容器中,然后发布到任何流行的Linux机器或Windows机器上。我们将Paddle Serving框架打包了多种Docker镜像,镜像列表参考《[Docker镜像](Docker_Images_CN.md)》,根据用户的使用场景选择镜像。为方便用户使用Docker,我们提供了帮助文档《[如何在Docker中运行PaddleServing](Run_In_Dokcer_CN.md)》。目前,Python webservice模式可在原生系统Linux和Windows双系统上部署运行。《[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
> 支持多种开发语言SDK
Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。
Paddle Serving提供了3种开发语言SDK,包括Python、C++、Java。Golang SDK在建设中,有兴趣的开源开发者可以提交PR。
+ Python,参考python/examples下client示例 或 4.2 web服务示例
+ C++,参考《[从零开始写一个预测服务](CREATING.md)
+ Java,参考《[Paddle Serving Client Java SDK](JAVA_SDK_CN.md)
+ Golang,参考《[如何在Paddle Serving使用Go Client](deprecated/IMDB_GO_CLIENT_CN.md)
+ C++,参考《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)
+ Java,参考《[Paddle Serving Client Java SDK](Java_SDK_CN.md)
> 支持多种硬件设备
......@@ -76,7 +76,7 @@ Paddle Serving提供了4种开发语言SDK,包括Python、C++、Java、Golang
以IMDB评论情感分析任务为例通过9步展示,Paddle Serving从模型的训练到部署预测服务的全流程《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。
由于无法直接查看模型文件中feed和fetch参数信息,不方便用户拼装参数。因此,Paddle Serving开发一个工具将Paddle模型转成Serving的格式,生成包含feed和fetch参数信息的prototxt文件。下图是uci_housing示例的生成的prototxt文件,更多转换方法参考文档《[怎样保存用于Paddle Serving的模型](Save_CN.md)》。
```
feed_var {
name: "x"
......@@ -124,15 +124,15 @@ C++ Serving的核心执行引擎是一个有向无环图,图中的每个节点
### 3.3 模型管理与热加载
Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](HOT_LOADING_IN_SERVING_CN.md)》。
Paddle Serving的C++引擎支持模型管理功能,支持多种模型和模型不同版本的管理。为了保证在模型更换期间推理服务的可用性,需要在服务不中断的情况下对模型进行热加载。Paddle Serving对该特性进行了支持,并提供了一个监控产出模型更新本地模型的工具,具体例子请参考《[Paddle Serving中的模型热加载](C++_Serving/Hot_Loading_CN.md)》。
### 3.4 模型加解密
Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](ENCRYPTION_CN.md)
Paddle Serving采用对称加密算法对模型进行加密,在服务加载模型过程中在内存中解密。目前,提供基础的模型安全能力,并不保证模型绝对安全性,用户可根据我们的设计加以完善,实现更高级别的安全性。说明文档参考《[加密模型预测](C++_Serving/Encryption_CN.md)
### 3.5 A/B Test
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](ABTEST_IN_PADDLE_SERVING_CN.md)》。
在对模型进行充分的离线评估后,通常需要进行在线A/B测试,来决定是否大规模上线服务。下图为使用Paddle Serving做A/B测试的基本结构,Client端做好相应的配置后,自动将流量分发给不同的Server,从而完成A/B测试。具体例子请参考《[如何使用Paddle Serving做ABTEST](C++_Serving/ABTEST_CN.md)》。
<p align="center">
<br>
......@@ -193,7 +193,7 @@ Pipeline Serving的网络框架采用gRPC和gPRC gateway。gRPC service接收RPC
</center>
### 5.2 核心设计与使用用例
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](PIPELINE_SERVING_CN.md)
Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Channel,通过组合实现一套有向无环图,设计与使用文档参考《[Pipeline Serving设计与实现](Python_Pipeline/Pipeline_Design_CN.md)
<center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
</center>
......@@ -201,11 +201,8 @@ Pipeline Serving核心设计是图执行引擎,基本处理单元是OP和Chann
## 6. 未来计划
### 6.1 云端自动部署能力
为了方便用户更容易将Paddle的预测模型部署到线上,Paddle Serving在接下来的版本会提供Kubernetes生态下任务编排的工具。
### 6.2 向量检索、树结构检索
### 6.1 向量检索、树结构检索
在推荐与广告场景的召回系统中,通常需要采用基于向量的快速检索或者基于树结构的快速检索,Paddle Serving会对这方面的检索引擎进行集成或扩展。
### 6.3 服务监控
### 6.2 服务监控
集成普罗米修斯监控,一套开源的监控&报警&时间序列数据库的组合,适合k8s和docker的监控系统。
# Paddle Serving Design Doc
([简体中文](./DESIGN_DOC_CN.md)|English)
([简体中文](./Serving_Design_CN.md)|English)
## 1. Design Objectives
......@@ -53,16 +53,16 @@ Paddle Serving takes into account a series of issues such as different operating
Cross-platform is not dependent on the operating system, nor on the hardware environment. Applications developed under one operating system can still run under another operating system. Therefore, the design should consider not only the development language and the cross-platform components, but also the interpretation differences of the compilers on different systems.
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](DOCKER_IMAGES.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](RUN_IN_DOCKER.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](WINDOWS_TUTORIAL.md)
Docker is an open source application container engine that allows developers to package their applications and dependencies into a portable container, and then publish it to any popular Linux machine or Windows machine. We have packaged a variety of Docker images for the Paddle Serving framework. Refer to the image list《[Docker Images](Docker_Images_EN.md)》, Select mirrors according to user's usage. We provide Docker usage documentation《[How to run PaddleServing in Docker](Run_In_Docker_EN.md)》.Currently, the Python webservice mode can be deployed and run on the native Linux and Windows dual systems.《[Paddle Serving for Windows Users](Windows_Tutorial_EN.md)
> Support multiple development languages client ​​SDKs
Paddle Serving provides 4 development language client SDKs, including Python, C++, Java, and Golang. Golang SDK is under construction, We hope that interested open source developers can help submit PR.
Paddle Serving provides 3 development language client SDKs, including Python, C++, Java, we hope that interested open source developers can help submit PR.
+ Python, Refer to the client example under python/examples or 4.2 web service example.
+ C++, Refer to《[从零开始写一个预测服务](CREATING.md)
+ Java, Refer to《[Paddle Serving Client Java SDK](JAVA_SDK.md)
+ Golang, Refer to《[How to use Go Client of Paddle Serving](deprecated/IMDB_GO_CLIENT.md)
+ C++, Refer to《[从零开始写一个预测服务](C++_Serving/Creat_C++Serving_CN.md)
+ Java, Refer to《[Paddle Serving Client Java SDK](Java_SDK_EN.md)
> Support multiple hardware devices
......@@ -72,7 +72,7 @@ The inference framework of the well-known deep learning platform only supports C
Models trained on other deep learning platforms can be passed《[PaddlePaddle/X2Paddle工具](https://github.com/PaddlePaddle/X2Paddle)》.We convert multiple mainstream CV models to Paddle models. TensorFlow, Caffe, ONNX, PyTorch model conversion is tested.《[AIStudio教程-Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/1555945)
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](SAVE.md)》.
Because it is impossible to directly view the feed and fetch parameter information in the model file, it is not convenient for users to assemble the parameters. Therefore, Paddle Serving developed a tool to convert the Paddle model into Serving format and generate a prototxt file containing feed and fetch parameter information. The following figure is the generated prototxt file of the uci_housing example. For more conversion methods, refer to the document《[How to save a servable model of Paddle Serving?](Save_EN.md)》.
```
feed_var {
name: "x"
......@@ -121,14 +121,14 @@ The core execution engine of Paddle Serving is a Directed acyclic graph(DAG). In
<p>
### 3.3 Model Management and Hot Reloading
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](HOT_LOADING_IN_SERVING.md) for specific examples.
C++ Serving supports model management functions, including management of multiple models and multiple model versions.In order to ensure the availability of services, the model needs to be hot loaded without service interruption. Paddle Serving supports this feature and provides a tool for monitoring output models to update local models. Please refer to [Hot loading in Paddle Serving](C++_Serving/Hot_Loading_EN.md) for specific examples.
### 3.4 MOEDL ENCRYPTION INFERENCE
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](ENCRYPTION.md)
Paddle Serving uses a symmetric encryption algorithm to encrypt the model, and decrypts it in memory during the service loading model. At present, providing basic model security capabilities does not guarantee absolute model security. Users can improve them according to our design to achieve a higher level of security. Documentation reference《[MOEDL ENCRYPTION INFERENCE](C++_Serving/Encryption_EN.md)
### 3.5 A/B Test
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](ABTEST_IN_PADDLE_SERVING.md) for specific examples.
After sufficient offline evaluation of the model, online A/B test is usually needed to decide whether to enable the service on a large scale. The following figure shows the basic structure of A/B test with Paddle Serving. After the client is configured with the corresponding configuration, the traffic will be automatically distributed to different servers to achieve A/B test. Please refer to [ABTEST in Paddle Serving](C++_Serving/ABTEST_EN.md) for specific examples.
<p align="center">
<br>
......@@ -193,7 +193,7 @@ The network framework of Pipeline Serving uses gRPC and gPRC gateway. The gRPC s
### 5.2 Core Design And Use Cases
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](PIPELINE_SERVING.md)
The core design of Pipeline Serving is a graph execution engine, and the basic processing units are OP and Channel. A set of directed acyclic graphs can be realized through combination. Reference for design and use documents《[Pipeline Serving](Python_Pipeline/Pipeline_Design_EN.md)
<center>
<img src='images/pipeline_serving-image2.png' height = "300" align="middle"/>
......
## Paddle Serving uses TensorRT
(English|[简体中文](./TENSOR_RT_CN.md))
### Background
Deploying models trained on mainstream frameworks through the tensorRT tool launched by Nvidia can greatly increase the speed of model inference, which is often at least 1 times faster than the original framework, and it also takes up more device memory. less. Therefore, it is very useful for all users who need to deploy models to master the method of deploying deep learning models with tensorRT. Paddle Serving provides comprehensive TensorRT ecological support.
### surroundings
Serving Cuda10.1 Cuda10.2 and Cuda11 versions support TensorRT.
#### Install Paddle
In [Development using Docker environment](./RUN_IN_DOCKER.md) and [Docker image list](./DOCKER_IMAGES.md), we give the development image of TensorRT. After using the mirror to start, you need to install the Paddle whl package that supports TensorRT, refer to the documentation on the home page
```
# GPU Cuda10.2 environment please execute
pip install paddlepaddle-gpu==2.0.0
```
**Note**: If your Cuda version is not 10.2, please do not execute the above commands directly, you need to refer to [Paddle official documentation-multi-version whl package list
](https://www.paddlepaddle.org.cn/documentation/docs/en/install/Tables_en.html#multi-version-whl-package-list-release)
Select the URL link of the corresponding GPU environment and install it. For example, for Python2.7 users of Cuda 10.1, please select `cp27-cp27mu` and
`cuda10.1-cudnn7.6-trt6.0.1.5` corresponding url, copy it and execute
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
Since the default `paddlepaddle-gpu==2.0.0` is Cuda 10.2 and TensorRT is not built, if you need to use TensorRT on `paddlepaddle-gpu`, you need to find `cuda10 in the above multi-version whl package list .2-cudnn8.0-trt7.1.3`, download the corresponding Python version.
#### Install Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### Use TensorRT
#### RPC mode
In [Serving model example](../python/examples), we have given models that can be accelerated using TensorRT, such as [Faster_RCNN model](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco) under detection
We just need
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
The TensorRT version of the faster_rcnn model server is started
#### Local Predictor mode
In [local_predictor](../python/paddle_serving_app/local_predict.py#L52), users can explicitly specify `use_trt=True` and pass it to `load_model_config`.
Other methods are no different from other Local Predictor methods, and you need to pay attention to the compatibility of the model with TensorRT.
#### Pipeline Mode
In [Pipeline mode](./PIPELINE_SERVING.md), our [imagenet example](../python/examples/pipeline/imagenet/config.yml#L23) gives the way to set TensorRT.
## Paddle Serving 使用 TensorRT
([English](./TENSOR_RT.md)|简体中文)
### 背景
通过Nvidia推出的tensorRT工具来部署主流框架上训练的模型能够极大的提高模型推断的速度,往往相比与原本的框架能够有至少1倍以上的速度提升,同时占用的设备内存也会更加的少。因此对是所有需要部署模型的用户来说,掌握用tensorRT来部署深度学习模型的方法是非常有用的。Paddle Serving提供了全面的TensorRT生态支持。
### 环境
Serving 的Cuda10.1 Cuda10.2和Cuda11版本支持TensorRT。
#### 安装Paddle
[使用Docker环境开发](./RUN_IN_DOCKER_CN.md)[Docker镜像列表](./DOCKER_IMAGES_CN.md)当中,我们给出了TensorRT的开发镜像。使用镜像启动之后,需要安装支持TensorRT的Paddle whl包,参考首页的文档
```
# GPU Cuda10.2环境请执行
pip install paddlepaddle-gpu==2.0.0
```
**注意**: 如果您的Cuda版本不是10.2,请勿直接执行上述命令,需要参考[Paddle官方文档-多版本whl包列表
](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-release)
选择相应的GPU环境的url链接并进行安装,例如Cuda 10.1的Python2.7用户,请选择表格当中的`cp27-cp27mu`
`cuda10.1-cudnn7.6-trt6.0.1.5`对应的url,复制下来并执行
```
pip install https://paddle-wheel.bj.bcebos.com/with-trt/2.0.0-gpu-cuda10.1-cudnn7-mkl/paddlepaddle_gpu-2.0.0.post101-cp27-cp27mu-linux_x86_64.whl
```
由于默认的`paddlepaddle-gpu==2.0.0`是Cuda 10.2,并没有联编TensorRT,因此如果需要和在`paddlepaddle-gpu`上使用TensorRT,需要在上述多版本whl包列表当中,找到`cuda10.2-cudnn8.0-trt7.1.3`,下载对应的Python版本。
#### 安装Paddle Serving
```
# Cuda10.2
pip install paddle-server-server==${VERSION}.post102
# Cuda 10.1
pip install paddle-server-server==${VERSION}.post101
# Cuda 11
pip install paddle-server-server==${VERSION}.post11
```
### 使用TensorRT
#### RPC模式
[Serving模型示例](../python/examples)当中,我们有给出可以使用TensorRT加速的模型,例如detection下的[Faster_RCNN模型](../python/examples/detection/faster_rcnn_r50_fpn_1x_coco)
我们只需
```
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar
tar xf faster_rcnn_r50_fpn_1x_coco.tar
python -m paddle_serving_server.serve --model serving_server --port 9494 --gpu_ids 0 --use_trt
```
TensorRT版本的faster_rcnn模型服务端就启动了
#### Local Predictor模式
[local_predictor](../python/paddle_serving_app/local_predict.py#L52)当中,用户可以显式制定`use_trt=True`传入到`load_model_config`当中。
其他方式和其他Local Predictor使用方法没有区别,需要注意模型对TensorRT的兼容性。
#### Pipeline模式
[Pipeline模式](./PIPELINE_SERVING_CN.md)当中,我们的[imagenet例子](../python/examples/pipeline/imagenet/config.yml#L23)给出了设置TensorRT的方式。
## Windows平台使用Paddle Serving指导
([English](./WINDOWS_TUTORIAL.md)|简体中文)
([English](./Windows_Turtial_EN.md)|简体中文)
### 综述
......@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./NEW_WEB_SERVICE_CN.md)
用户只需要按照如上指示,在对应函数中实现相关内容即可。更多信息请参见[如何开发一个新的Web Service?](./C++_Serving/Http_Service_CN.md)
开发完成后执行
......
## Paddle Serving for Windows Users
(English|[简体中文](./WINDOWS_TUTORIAL_CN.md))
(English|[简体中文](./Windows_Tutorial_CN.md))
### Summary
......@@ -97,7 +97,7 @@ r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./NEW_WEB_SERVICE.md)
The user only needs to follow the above instructions and implement the relevant content in the corresponding function. For more information, please refer to [How to develop a new Web Service? ](./C++_Serving/Http_Service_EN.md)
Execute after development
......
# 搭建预测服务集群
[客户端配置](../CLIENT_CONFIGURE.md)中我们已经知道,通过在客户端SDK的配置文件predictors.prototxt适当配置,可以搭建多副本和多Variant的预测集群。以下以图像分类任务为例,在单机上模拟搭建单Variant的多副本、和多Variant的预测集群
## 1. 单Variant多副本的预测集群
### 1.1 在本机创建一个serving副本
首先复制一个sering目录
```shell
$ cd /path/to/paddle-serving/build/output/demo
$ cp -r serving/ serving_new/
$ cd serving_new/
```
在serving_new目录中,在conf/gflags.conf中增加如下一行,修改其启动端口为8011,这是为了让该副本监听不同端口
```shell
--port=8011
```
然后启动新副本
```shell
$ bin/serving&
```
### 1.2 修改client端配置,将新副本地址加入ip列表:
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
```
修改conf/predictors.prototxt ImageClassifyService部分如下所示
```JSON
predictors {
name: "ximage"
service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService"
endpoint_router: "WeightedRandomRender"
weighted_random_render_conf {
variant_weight_list: "50"
}
variants {
tag: "var1"
naming_conf {
cluster: "list://127.0.0.1:8010, 127.0.0.1:8011" # 在这里增加一个新的副本地址
}
}
}
```
重启client端
```shell
$ bin/ximage&
```
查看2个serving副本目录下是否均有收到请求:
```shell
$ cd /path/to/paddle-serving/build/output/demo/serving
$ tail -f log/serving.INFO
$ cd /path/to/paddle-serving/build/output/demo/serving_new
$ tail -f log/serving.INFO
```
## 2. 多Variant
### 2.1 本机创建新的serving副本
步骤同1.1节,略过
### 2.2 修改client配置,增加一个Variant
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
```
修改conf/predictors.prototxt ImageClassifyService部分如下所示
```JSON
predictors {
name: "ximage"
service_name: "baidu.paddle_serving.predictor.image_classification.ImageClassifyService"
endpoint_router: "WeightedRandomRender"
weighted_random_render_conf {
variant_weight_list: "50 | 50" # 一共2个variant,代表模型的2个版本。这里的权重代表调度的流量比例关系
}
variants {
tag: "var1"
naming_conf {
cluster: "list://127.0.0.1:8010"
}
}
variants { # 增加一个variant
tag: "var2"
naming_conf {
cluster: "list://127.0.0.1:8011"
}
}
}
```
重启client端
```shell
$ bin/ximage&
```
查看2个serving副本目录下是否均有收到请求:
```shell
$ cd /path/to/paddle-serving/build/output/demo/serving
$ tail -f log/serving.INFO
$ cd /path/to/paddle-serving/build/output/demo/serving_new
$ tail -f log/serving.INFO
```
查看client端是否有收到来自Variant1和Variant2的响应
```shell
$ cd /path/to/paddle-serving/build/output/demo/client/image_classification
$ tail -f log/ximage.INFO
```
以下是正常的输出
```
I0307 17:54:22.862087 24719 ximage.cpp:172] Debug string:
I0307 17:54:22.862650 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815
I0307 17:54:22.862666 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var1, elapse_ms: 333
I0307 17:54:23.194780 24719 ximage.cpp:172] Debug string:
I0307 17:54:23.195322 24719 ximage.cpp:110] sample-0's classify result: n02112018,博美犬, prop: 0.522815
I0307 17:54:23.195334 24719 ximage.cpp:114] Succ call predictor[ximage], the tag is: var2, elapse_ms: 332
```
# CTR预估模型
## 1. 背景
在搜索、推荐、在线广告等业务场景中,embedding参数的规模常常非常庞大,达到数百GB甚至T级别;训练如此规模的模型需要用到多机分布式训练能力,将参数分片更新和保存;另一方面,训练好的模型,要应用于在线业务,也难以单机加载。Paddle Serving提供大规模稀疏参数读写服务,用户可以方便地将超大规模的稀疏参数以kv形式托管到参数服务,在线预测只需将所需要的参数子集从参数服务读取回来,再执行后续的预测流程。
我们以CTR预估模型为例,演示Paddle Serving中如何使用大规模稀疏参数服务。关于模型细节请参考[原始模型](https://github.com/PaddlePaddle/models/tree/v1.5/PaddleRec/ctr)
根据[对数据集的描述](https://www.kaggle.com/c/criteo-display-ad-challenge/data),该模型原始输入为13维integer features和26维categorical features。在我们的模型中,13维integer feature作为dense feature整体feed到一个data layer,而26维categorical features各自作为一个feature分别feed到一个data layer。除此之外,为计算auc指标,还将label作为一个feature输入。
若按缺省训练参数,本模型的embedding dim为100w,size为10,也就是参数矩阵为1000000 x 10的float型矩阵,实际占用内存共1000000 x 10 x sizeof(float) = 39MB;**实际场景中,embedding参数要大的多;因此该demo仅为演示使用**
## 2. 模型裁剪
在写本文档时([v1.5](https://github.com/PaddlePaddle/models/tree/v1.5)),训练脚本用PaddlePaddle py_reader加速样例读取速度,program中带有py_reader相关OP,且训练过程中只保存了模型参数,没有保存program,保存的参数没法直接用预测库加载;另外原始网络中最终输出的tensor是auc和batch_auc,而实际模型用于预测时只需要每个样例的predict,需要改掉模型的输出tensor为predict。再有,为了演示稀疏参数服务的使用,我们要有意将embedding layer包含的lookup_table OP从预测program中拿掉,以embedding layer的output variable作为网络的输入,然后再添加对应的feed OP,使得我们能够在预测时从稀疏参数服务获取到embedding向量后,将数据直接feed到各个embedding的output variable。
基于以上几方面考虑,我们需要对原始program进行裁剪。大致过程为:
1) 去掉py_reader相关代码,改为用fluid自带的reader和DataFeed
2) 修改原始网络配置,将predict变量作为fetch target
3) 修改原始网络配置,将26个稀疏参数的embedding layer的output作为feed target,以与后续稀疏参数服务配合使用
4) 修改后的网络,本地train 1个batch后,调用`fluid.io.save_inference_model()`,获得裁剪后的模型program
5) 裁剪后的program,用python再次处理,去掉embedding layer的lookup_table OP。这是因为,当前Paddle Fluid在第4步`save_inference_model()`时没有裁剪干净,还保留了embedding的lookup_table OP;如果这些OP不去除掉,那么embedding的output variable就会有2个输入OP:一个是feed OP(我们要添加的),一个是lookup_table;而lookup_table又没有输入,它的输出会与feed OP的输出互相覆盖,导致错乱。另外网络中还保留了SparseFeatFactors这个variable(全局共享的embedding矩阵对应的变量),这个variable也要去掉,否则网络加载时还会尝试从磁盘读取embedding参数,就失去了我们这个demo的意义。
6) 第4步拿到的program,与分布式训练保存的模型参数(除embedding之外)保存到一起,形成完整的预测模型
第1) - 第5)步裁剪完毕后的模型网络配置如下:
![Pruned CTR prediction network](../images/pruned-ctr-network.png)
整个裁剪过程具体说明如下:
### 2.1 网络配置中去除py_reader
Inference program调用ctr_dnn_model()函数时添加`user_py_reader=False`参数。这会在ctr_dnn_model定义中将py_reader相关的代码去掉
修改前:
```python
def train():
args = parse_args()
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim)
...
```
修改后:
```python
def train():
args = parse_args()
if not os.path.isdir(args.model_output_dir):
os.mkdir(args.model_output_dir)
loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(args.embedding_size, args.sparse_feature_dim, use_py_reader=False)
...
```
### 2.2 网络配置中修改feed targets和fetch targets
如第2节开头所述,为了使program适合于演示稀疏参数的使用,我们要裁剪program,将`ctr_dnn_model`中feed variable list和fetch variable分别改掉:
1) Inference program中26维稀疏特征的输入改为每个特征的embedding layer的output variable
2) fetch targets中返回的是predict,取代auc_var和batch_auc_var
截至写本文时,原始的网络配置 (network_conf.py中)`ctr_dnn_model`定义如下:
```python
def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
def embedding_layer(input):
emb = fluid.layers.embedding(
input=input,
is_sparse=True,
# you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
# if you want to set is_distributed to True
is_distributed=False,
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors",
initializer=fluid.initializer.Uniform()))
return fluid.layers.sequence_pool(input=emb, pool_type='average') # 需修改1
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64')
for i in range(1, 27)]
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
words = [dense_input] + sparse_input_ids + [label]
py_reader = None
if use_py_reader:
py_reader = fluid.layers.create_py_reader_by_data(capacity=64,
feed_list=words,
name='py_reader',
use_double_buffer=True)
words = fluid.layers.read_file(py_reader)
sparse_embed_seq = list(map(embedding_layer, words[1:-1])) # 需修改2
concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1)
fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(concated.shape[1]))))
fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc1.shape[1]))))
fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc2.shape[1]))))
predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc3.shape[1]))))
cost = fluid.layers.cross_entropy(input=predict, label=words[-1])
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=words[-1])
auc_var, batch_auc_var, auc_states = \
fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20)
return avg_cost, auc_var, batch_auc_var, py_reader, words # 需修改3
```
修改后
```python
def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
def embedding_layer(input):
emb = fluid.layers.embedding(
input=input,
is_sparse=True,
# you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
# if you want to set is_distributed to True
is_distributed=False,
size=[sparse_feature_dim, embedding_size],
param_attr=fluid.ParamAttr(name="SparseFeatFactors",
initializer=fluid.initializer.Uniform()))
seq = fluid.layers.sequence_pool(input=emb, pool_type='average')
return emb, seq # 对应上文修改处1
dense_input = fluid.layers.data(
name="dense_input", shape=[dense_feature_dim], dtype='float32')
sparse_input_ids = [
fluid.layers.data(name="C" + str(i), shape=[1], lod_level=1, dtype='int64')
for i in range(1, 27)]
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
words = [dense_input] + sparse_input_ids + [label]
sparse_embed_and_seq = list(map(embedding_layer, words[1:-1]))
emb_list = [x[0] for x in sparse_embed_and_seq] # 对应上文修改处2
sparse_embed_seq = [x[1] for x in sparse_embed_and_seq]
concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1)
train_feed_vars = words # 对应上文修改处2
inference_feed_vars = emb_list + words[0:1]
fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(concated.shape[1]))))
fc2 = fluid.layers.fc(input=fc1, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc1.shape[1]))))
fc3 = fluid.layers.fc(input=fc2, size=400, act='relu',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc2.shape[1]))))
predict = fluid.layers.fc(input=fc3, size=2, act='softmax',
param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
scale=1 / math.sqrt(fc3.shape[1]))))
cost = fluid.layers.cross_entropy(input=predict, label=words[-1])
avg_cost = fluid.layers.reduce_sum(cost)
accuracy = fluid.layers.accuracy(input=predict, label=words[-1])
auc_var, batch_auc_var, auc_states = \
fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20)
fetch_vars = [predict]
# 对应上文修改处3
return avg_cost, auc_var, batch_auc_var, train_feed_vars, inference_feed_vars, fetch_vars
```
说明:
1) 修改处1,我们将embedding layer的输出变量返回
2) 修改处2,我们将embedding layer的输出变量保存到`emb_list`,后者进一步保存到`inference_feed_vars`,用来将来在`save_inference_model()`时指定feed variable list。
3) 修改处3,我们将`words`变量作为训练时的feed variable list (`train_feed_vars`),将embedding layer的output variable作为infer时的feed variable list (`inference_feed_vars`),将`predict`作为fetch target (`fetch_vars`),分别返回。`inference_feed_vars``fetch_vars`用于`fluid.io.save_inference_model()`时指定feed variable list和fetch target list
### 2.3 fluid.io.save_inference_model()保存裁剪后的program
`fluid.io.save_inference_model()`不仅保存模型参数,还能够根据feed variable list和fetch target list参数,对program进行裁剪,形成适合inference用的program。大致原理是,根据前向网络配置,从fetch target list开始,反向查找其所依赖的OP列表,并将每个OP的输入加入目标variable list,再次递归地反向找到所有依赖OP和variable list。
在2.2节中我们已经拿到所需的`inference_feed_vars``fetch_vars`,接下来只要在训练过程中每次保存模型参数时改为调用`fluid.io.save_inference_model()`
修改前:
```python
def train_loop(args, train_program, py_reader, loss, auc_var, batch_auc_var,
trainer_num, trainer_id):
...省略
for pass_id in range(args.num_passes):
pass_start = time.time()
batch_id = 0
py_reader.start()
try:
while True:
loss_val, auc_val, batch_auc_val = pe.run(fetch_list=[loss.name, auc_var.name, batch_auc_var.name])
loss_val = np.mean(loss_val)
auc_val = np.mean(auc_val)
batch_auc_val = np.mean(batch_auc_val)
logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}"
.format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val))
if batch_id % 1000 == 0 and batch_id != 0:
model_dir = args.model_output_dir + '/batch-' + str(batch_id)
if args.trainer_id == 0:
fluid.io.save_persistables(executor=exe, dirname=model_dir,
main_program=fluid.default_main_program())
batch_id += 1
except fluid.core.EOFException:
py_reader.reset()
print("pass_id: %d, pass_time_cost: %f" % (pass_id, time.time() - pass_start))
...省略
```
修改后
```python
def train_loop(args,
train_program,
train_feed_vars,
inference_feed_vars, # 裁剪program用的feed variable list
fetch_vars, # 裁剪program用的fetch variable list
loss,
auc_var,
batch_auc_var,
trainer_num,
trainer_id):
# 因为已经将py_reader去掉,这里用fluid自带的DataFeeder
dataset = reader.CriteoDataset(args.sparse_feature_dim)
train_reader = paddle.batch(
paddle.reader.shuffle(
dataset.train([args.train_data_path], trainer_num, trainer_id),
buf_size=args.batch_size * 100),
batch_size=args.batch_size)
inference_feed_var_names = [var.name for var in inference_feed_vars]
place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
total_time = 0
pass_id = 0
batch_id = 0
feed_var_names = [var.name for var in feed_vars]
feeder = fluid.DataFeeder(feed_var_names, place)
for data in train_reader():
loss_val, auc_val, batch_auc_val = exe.run(fluid.default_main_program(),
feed = feeder.feed(data),
fetch_list=[loss.name, auc_var.name, batch_auc_var.name])
fluid.io.save_inference_model(model_dir,
inference_feed_var_names,
fetch_vars,
exe,
fluid.default_main_program())
break # 我们只要裁剪后的program,不需要模型参数,因此只train一个batch就停止了
loss_val = np.mean(loss_val)
auc_val = np.mean(auc_val)
batch_auc_val = np.mean(batch_auc_val)
logger.info("TRAIN --> pass: {} batch: {} loss: {} auc: {}, batch_auc: {}"
.format(pass_id, batch_id, loss_val/args.batch_size, auc_val, batch_auc_val))
```
### 2.4 用python再次处理inference program,去除lookup_table OP和SparseFeatFactors变量
这一步是因为`fluid.io.save_inference_model()`裁剪出的program没有将lookup_table OP去除。未来如果`save_inference_model`接口完善,本节可跳过
主要代码:
```python
def prune_program():
args = parse_args()
# 从磁盘打开网络配置文件并反序列化成protobuf message
model_dir = args.model_output_dir + "/inference_only"
model_file = model_dir + "/__model__"
with open(model_file, "rb") as f:
protostr = f.read()
f.close()
proto = framework_pb2.ProgramDesc.FromString(six.binary_type(protostr))
# 去除lookup_table OP
block = proto.blocks[0]
kept_ops = [op for op in block.ops if op.type != "lookup_table"]
del block.ops[:]
block.ops.extend(kept_ops)
# 去除SparseFeatFactors var
kept_vars = [var for var in block.vars if var.name != "SparseFeatFactors"]
del block.vars[:]
block.vars.extend(kept_vars)
# 写回磁盘文件
with open(model_file + ".pruned", "wb") as f:
f.write(proto.SerializePartialToString())
f.close()
with open(model_file + ".prototxt.pruned", "w") as f:
f.write(text_format.MessageToString(proto))
f.close()
```
### 2.5 裁剪过程串到一起
我们提供了完整的裁剪CTR预估模型的脚本文件save_program.py,同[CTR分布式训练和Serving流程化部署](https://github.com/PaddlePaddle/Serving/blob/master/doc/DEPLOY.md)一起发布,可以在trainer和pserver容器的训练脚本目录下找到,也可以在[这里](https://github.com/PaddlePaddle/Serving/tree/master/doc/resource)下载。
## 3. 整个预测计算流程
Client端:
1) Dense feature: 从dataset每条样例读取13个integer features,形成1个dense feature
2) Sparse feature: 从dataset每条样例读取26个categorical feature,分别经过hash(str(feature_index) + feature_string)签名,得到每个feature的id,形成26个sparse feature
Serving端:
1) Dense feature: dense feature共13个float型数字,一起feed到网络dense_input这个variable对应的LodTensor
2) Sparse feature: 26个sparse feature id,分别访问kv服务获取对应的embedding向量,feed到对应的26个embedding layer的output variable。在我们裁剪出来的网络中,这些variable分别对应的变量名为embedding_0.tmp_0, embedding_1.tmp_0, ... embedding_25.tmp_0
3) 执行预测,获取预测结果。
# FAQ
## 1. 如何修改端口配置?
使用该框架搭建的服务需要申请一个端口,可以通过以下方式修改端口号:
- 如果在inferservice_file里指定了port:xxx,那么就去申请该端口号;
- 否则,如果在gflags.conf里指定了--port:xxx,那就去申请该端口号;
- 否则,使用程序里指定的默认端口号:8010。
## 2. GPU预测中为何请求的响应时间波动会非常大?
PaddleServing依托PaddlePaddle预测库执行预测计算;在GPU设备上,由于同一个进程内目前共用1个GPU stream,进程内的多个请求的预测计算会被严格串行。所以如果有2个请求同时到达某个Serving实例,不管该实例启动时创建了多少个worker线程,都不能起到加速作用,后到的请求会被排队,直到前面请求计算完成。
## 3. 如何充分利用GPU卡的计算能力?
如问题2所说,由于预测库的限制,单个Serving进程只能绑定单张GPU卡,且进程内共用1个GPU stream,所有请求必须串行计算。
为提高GPU卡使用率,目前可以想到的方法是:在单张GPU卡上启动多个Serving进程,每个进程绑定一个GPU stream,多个stream并行计算。这种方法是否能起到加速作用,受限于多个因素,主要有:
1. 单个stream占用GPU算力;假如单个stream已经将GPU算力占用超过50%,那么增加stream很可能会导致2个stream的job分别排队,拖慢各自的响应时间
2. GPU显存:Serving进程需要将模型参数加载到显存中,并且计算时要在GPU显存池分配临时变量;假如单个Serving进程已经用掉超过50%的显存,则增加Serving进程会造成显存不足,导致进程报错退出
为此,可采用如下步骤,进行测试:
1. 加载模型时,在model_toolkit.prototxt中,model type选择FLUID_GPU_ANALYSIS或FLUID_GPU_ANALYSIS_DIR;会对模型进行静态分析,进行一定程度显存优化
2. 在步骤1完成后,启动单个Serving进程,启动参数:`--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4`;启动一个client,进行并发度为1的压力测试,batch size从小到大,记下平响;由于算力的限制,当batch size增大到一定程度,应该会出现响应时间明显变大;或虽然没有明显变大,但已经不满足系统需求
3. 再启动1个Serving进程,与步骤2启动时使用相同的参数略有不同: `--gpuid=N --bthread_concurrency=4 --bthread_min_concurrency=4 --port=8011` 其中--port=8011用来让新启动的进程使用一个新的服务端口;然后同时对这2个Serving进程进行压测,继续观察batch size从小到大时平均响应时间的变化,直到取得batch size和响应时间的折中
4. 重复步骤2-3
5. 以2-4步的测试,来决定:单张GPU卡可以由多少个Serving进程共用; 实际部署时,就在一张GPU卡上启动这么多个Serving进程同时提供服务
# HTTP Inferface
Paddle Serving服务均可以通过HTTP接口访问,客户端只需按照Service定义的Request消息格式构造json字符串即可。客户端构造HTTP请求,将json格式数据以POST请求发给serving端,serving端**自动**按Service定义的Protobuf消息格式,将json数据转换成protobuf消息。
本文档介绍以python和PHP语言访问Serving的HTTP服务接口的用法。
## 1. 访问地址
访问Serving节点的HTTP服务与C++服务使用同一个端口(例如8010),访问URL规则为:
```
http://127.0.0.1:8010/ServiceName/inference
http://127.0.0.1:8010/ServiceName/debug
```
其中ServiceName应该与Serving的配置文件`conf/services.prototxt`中配置的一致,假如有如下2个service:
```protobuf
services {
name: "BuiltinTestEchoService"
workflows: "workflow3"
}
services {
name: "TextClassificationService"
workflows: "workflow6"
}
```
则访问上述2个Serving服务的HTTP URL分别为:
```
http://127.0.0.1:8010/BuiltinTestEchoService/inference
http://127.0.0.1:8010/BuiltinTestEchoService/debug
http://127.0.0.1:8010/TextClassificationService/inference
http://127.0.0.1:8010/TextClassificationService/debug
```
## 2. Python访问HTTP Serving
Python语言访问HTTP Serving,关键在于构造json格式的请求数据,可以通过以下步骤完成:
1) 按照Service定义的Request消息格式构造python object
2) `json.dump()` / `json.dumps()` 等函数将python object转换成json格式字符串
以TextClassificationService为例,关键代码如下:
```python
# Connect to server
conn = httplib.HTTPConnection("127.0.0.1", 8010)
# samples是一个list,其中每个元素是一个ids字典:
# samples[0] = [190, 1, 70, 382, 914, 5146, 190...]
for i in range(0, len(samples) - BATCH_SIZE, BATCH_SIZE):
# 构建批量预测数据
batch = samples[i: i + BATCH_SIZE]
ids = []
for x in batch:
ids.append({"ids" : x})
ids = {"instances": ids}
# python object转成json
request_json = json.dumps(ids)
# 请求HTTP服务,打印response
try:
conn.request('POST', "/TextClassificationService/inference", request_json, {"Content-Type": "application/json"})
response = conn.getresponse()
print response.read()
except httplib.HTTPException as e:
print e.reason
```
完整示例请参考[text_classification.py](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/python/text_classification.py)
## 3. PHP访问HTTP Serving
PHP语言构造json格式字符串的步骤如下:
1) 按照Service定义的Request消息格式,构造PHP array
2) `json_encode()`函数将PHP array转换成json字符串
以TextCLassificationService为例,关键代码如下:
```PHP
function http_post(&$ch, $data) {
// array to json string
$data_string = json_encode($data);
// post data 封装
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
// set header
curl_setopt($ch,
CURLOPT_HTTPHEADER,
array(
'Content-Length: ' . strlen($data_string)
)
);
// 执行
$result = curl_exec($ch);
return $result;
}
$ch = &http_connect('http://127.0.0.1:8010/TextClassificationService/inference');
$count = 0;
# $samples是一个2层array,其中每个元素是一个如下array:
# $samples[0] = array(
# "ids" => array(
# [0] => int(190),
# [1] => int(1),
# [2] => int(70),
# [3] => int(382),
# [4] => int(914),
# [5] => int(5146),
# [6] => int(190)...)
# )
for ($i = 0; $i < count($samples) - BATCH_SIZE; $i += BATCH_SIZE) {
$instances = array_slice($samples, $i, BATCH_SIZE);
echo http_post($ch, array("instances" => $instances)) . "\n";
}
curl_close($ch);
```
完整代码请参考[text_classification.php](https://github.com/PaddlePaddle/Serving/blob/develop/tools/cpp_examples/demo-client/php/text_classification.php)
# Model Ensemble in Paddle Serving
([简体中文](MODEL_ENSEMBLE_IN_PADDLE_SERVING_CN.md)|English)
In some scenarios, multiple models with the same input may be used to predict in parallel and integrate predicted results for better prediction effect. Paddle Serving also supports this feature.
Next, we will take the text classification task as an example to show model ensemble in Paddle Serving (This feature is still serial prediction for the time being. We will support parallel prediction as soon as possible).
## Simple example
In this example (see the figure below), the server side predict the bow and CNN models with the same input in a service in parallel, The client side fetchs the prediction results of the two models, and processes the prediction results to get the final predict results.
![simple example](../images/model_ensemble_example.png)
It should be noted that at present, only multiple models with the same format input and output in the same service are supported. In this example, the input and output formats of CNN and BOW model are the same.
The code used in the example is saved in the `python/examples/imdb` path:
```shell
.
├── get_data.sh
├── imdb_reader.py
├── test_ensemble_client.py
└── test_ensemble_server.py
```
### Prepare data
Get the pre-trained CNN and BOW models by the following command (you can also run the `get_data.sh` script):
```shell
wget --no-check-certificate https://fleet.bj.bcebos.com/text_classification_data.tar.gz
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imdb-demo/imdb_model.tar.gz
tar -zxvf text_classification_data.tar.gz
tar -zxvf imdb_model.tar.gz
```
### Start server
Start server by the following Python code (you can also run the `test_ensemble_server.py` script):
```python
from paddle_serving_server import OpMaker
from paddle_serving_server import OpGraphMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('general_reader')
cnn_infer_op = op_maker.create(
'general_infer', engine_name='cnn', inputs=[read_op])
bow_infer_op = op_maker.create(
'general_infer', engine_name='bow', inputs=[read_op])
response_op = op_maker.create(
'general_response', inputs=[cnn_infer_op, bow_infer_op])
op_graph_maker = OpGraphMaker()
op_graph_maker.add_op(read_op)
op_graph_maker.add_op(cnn_infer_op)
op_graph_maker.add_op(bow_infer_op)
op_graph_maker.add_op(response_op)
server = Server()
server.set_op_graph(op_graph_maker.get_op_graph())
model_config = {cnn_infer_op: 'imdb_cnn_model', bow_infer_op: 'imdb_bow_model'}
server.load_model_config(model_config)
server.prepare_server(workdir="work_dir1", port=9393, device="cpu")
server.run_server()
```
Different from the normal prediction service, here we need to use DAG to describe the logic of the server side.
When creating an Op, you need to specify the predecessor of the current Op (in this example, the predecessor of `cnn_infer_op` and `bow_infer_op` is `read_op`, and the predecessor of `response_op` is `cnn_infer_op` and `bow_infer_op`. For the infer Op `infer_op`, you need to define the prediction engine name `engine_name` (You can also use the default value. It is recommended to set the value to facilitate the client side to obtain the order of prediction results).
At the same time, when configuring the model path, you need to create a model configuration dictionary with the infer Op as the key and the corresponding model path as value to inform Serving which model each infer OP uses.
### Start client
Start client by the following Python code (you can also run the `test_ensemble_client.py` script):
```python
from paddle_serving_client import Client
from imdb_reader import IMDBDataset
client = Client()
# If you have more than one model, make sure that the input
# and output of more than one model are the same.
client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt')
client.connect(["127.0.0.1:9393"])
# you can define any english sentence or dataset here
# This example reuses imdb reader in training, you
# can define your own data preprocessing easily.
imdb_dataset = IMDBDataset()
imdb_dataset.load_resource('imdb.vocab')
for i in range(3):
line = 'i am very sad | 0'
word_ids, label = imdb_dataset.get_words_and_label(line)
feed = {"words": word_ids}
fetch = ["acc", "cost", "prediction"]
fetch_maps = client.predict(feed=feed, fetch=fetch)
if len(fetch_maps) == 1:
print("step: {}, res: {}".format(i, fetch_maps['prediction'][0][1]))
else:
for model, fetch_map in fetch_maps.items():
print("step: {}, model: {}, res: {}".format(i, model, fetch_map[
'prediction'][0][1]))
```
Compared with the normal prediction service, the client side has not changed much. When multiple model predictions are used, the prediction service will return a dictionary with engine name `engine_name`(the value is defined on the server side) as the key, and the corresponding model prediction results as the value.
### Expected result
```shell
step: 0, model: cnn, res: 0.560272455215
step: 0, model: bow, res: 0.633530199528
step: 1, model: cnn, res: 0.560272455215
step: 1, model: bow, res: 0.633530199528
step: 2, model: cnn, res: 0.560272455215
step: 2, model: bow, res: 0.633530199528
```
# Paddle Serving中的集成预测
(简体中文|[English](MODEL_ENSEMBLE_IN_PADDLE_SERVING.md))
在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。
下面将以文本分类任务为例,来展示Paddle Serving的集成预测功能(暂时还是串行预测,我们会尽快支持并行化)。
## 集成预测样例
该样例中(见下图),Server端在一项服务中并行预测相同输入的BOW和CNN模型,Client端获取两个模型的预测结果并进行后处理,得到最终的预测结果。
![simple example](../images/model_ensemble_example.png)
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
样例中用到的代码保存在`python/examples/imdb`路径下:
```shell
.
├── get_data.sh
├── imdb_reader.py
├── test_ensemble_client.py
└── test_ensemble_server.py
```
### 数据准备
通过下面命令获取预训练的CNN和BOW模型(您也可以直接运行`get_data.sh`脚本):
```shell
wget --no-check-certificate https://fleet.bj.bcebos.com/text_classification_data.tar.gz
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imdb-demo/imdb_model.tar.gz
tar -zxvf text_classification_data.tar.gz
tar -zxvf imdb_model.tar.gz
```
### 启动Server
通过下面的Python代码启动Server端(您也可以直接运行`test_ensemble_server.py`脚本):
```python
from paddle_serving_server import OpMaker
from paddle_serving_server import OpGraphMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('general_reader')
cnn_infer_op = op_maker.create(
'general_infer', engine_name='cnn', inputs=[read_op])
bow_infer_op = op_maker.create(
'general_infer', engine_name='bow', inputs=[read_op])
response_op = op_maker.create(
'general_response', inputs=[cnn_infer_op, bow_infer_op])
op_graph_maker = OpGraphMaker()
op_graph_maker.add_op(read_op)
op_graph_maker.add_op(cnn_infer_op)
op_graph_maker.add_op(bow_infer_op)
op_graph_maker.add_op(response_op)
server = Server()
server.set_op_graph(op_graph_maker.get_op_graph())
model_config = {cnn_infer_op: 'imdb_cnn_model', bow_infer_op: 'imdb_bow_model'}
server.load_model_config(model_config)
server.prepare_server(workdir="work_dir1", port=9393, device="cpu")
server.run_server()
```
与普通预测服务不同的是,这里我们需要用DAG来描述Server端的运行逻辑。
在创建Op的时候需要指定当前Op的前继(在该例子中,`cnn_infer_op``bow_infer_op`的前继均是`read_op``response_op`的前继是`cnn_infer_op``bow_infer_op`),对于预测Op`infer_op`还需要定义预测引擎名称`engine_name`(也可以使用默认值,建议设置该值方便Client端获取预测结果)。
同时在配置模型路径时,需要以预测Op为key,对应的模型路径为value,创建模型配置字典,来告知Serving每个预测Op使用哪个模型。
### 启动Client
通过下面的Python代码运行Client端(您也可以直接运行`test_ensemble_client.py`脚本):
```python
from paddle_serving_client import Client
from imdb_reader import IMDBDataset
client = Client()
# If you have more than one model, make sure that the input
# and output of more than one model are the same.
client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt')
client.connect(["127.0.0.1:9393"])
# you can define any english sentence or dataset here
# This example reuses imdb reader in training, you
# can define your own data preprocessing easily.
imdb_dataset = IMDBDataset()
imdb_dataset.load_resource('imdb.vocab')
for i in range(3):
line = 'i am very sad | 0'
word_ids, label = imdb_dataset.get_words_and_label(line)
feed = {"words": word_ids}
fetch = ["acc", "cost", "prediction"]
fetch_maps = client.predict(feed=feed, fetch=fetch)
if len(fetch_maps) == 1:
print("step: {}, res: {}".format(i, fetch_maps['prediction'][0][1]))
else:
for model, fetch_map in fetch_maps.items():
print("step: {}, model: {}, res: {}".format(i, model, fetch_map[
'prediction'][0][1]))
```
Client端与普通预测服务没有发生太大的变化。当使用多个模型预测时,预测服务将返回一个key为Server端定义的引擎名称`engine_name`,value为对应的模型预测结果的字典。
### 预期结果
```txt
step: 0, model: cnn, res: 0.560272455215
step: 0, model: bow, res: 0.633530199528
step: 1, model: cnn, res: 0.560272455215
step: 1, model: bow, res: 0.633530199528
step: 2, model: cnn, res: 0.560272455215
step: 2, model: bow, res: 0.633530199528
```
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册