未验证 提交 e5615b32 编写于 作者: T TeslaZhao 提交者: GitHub

Merge pull request #1791 from TeslaZhao/v0.9.0

V0.9.0 cherry-pick
([简体中文](./README_CN.md)|English)
(简体中文|[English](./README.md))
<p align="center">
<br>
......@@ -24,33 +24,36 @@
***
The goal of Paddle Serving is to provide high-performance, flexible and easy-to-use industrial-grade online inference services for machine learning developers and enterprises.Paddle Serving supports multiple protocols such as RESTful, gRPC, bRPC, and provides inference solutions under a variety of hardware and multiple operating system environments, and many famous pre-trained model examples. The core features are as follows:
Paddle Serving 依托深度学习框架 PaddlePaddle 旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving 支持 RESTful、gRPC、bRPC 等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。核心特性如下:
- 集成高性能服务端推理引擎 [Paddle Inference](https://paddleinference.paddlepaddle.org.cn/product_introduction/inference_intro.html) 和端侧引擎 [Paddle Lite](https://paddlelite.paddlepaddle.org.cn/introduction/tech_highlights.html),其他机器学习平台(Caffe/TensorFlow/ONNX/PyTorch)可通过 [x2paddle](https://github.com/PaddlePaddle/X2Paddle) 工具迁移模型
- 具有高性能 C++ Serving 和高易用 Python Pipeline 2套框架。C++ Serving 基于高性能 bRPC 网络框架打造高吞吐、低延迟的推理服务,性能领先竞品。Python Pipeline 基于 gRPC/gRPC-Gateway 网络框架和 Python 语言构建高易用、高吞吐推理服务框架。技术选型参考[技术选型](doc/Serving_Design_CN.md#21-设计选型)
- 支持 HTTP、gRPC、bRPC 等多种[协议](doc/C++_Serving/Inference_Protocols_CN.md);提供 C++、Python、Java 语言 SDK
- 设计并实现基于有向无环图(DAG) 的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性
- 适配 x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑 XPU、华为昇腾310/910、海光 DCU、Nvidia Jetson 等多种硬件
- 集成 Intel MKLDNN、Nvidia TensorRT 加速库,以及低精度量化推理
- 提供一套模型安全部署解决方案,包括加密模型部署、鉴权校验、HTTPs 安全网关,并在实际项目中应用
- 支持云端部署,提供百度云智能云 kubernetes 集群部署 Paddle Serving 案例
- 提供丰富的经典模型部署示例,如 PaddleOCR、PaddleClas、PaddleDetection、PaddleSeg、PaddleNLP、PaddleRec 等套件,共计40+个预训练精品模型
- 支持大规模稀疏参数索引模型分布式部署,具有多表、多分片、多副本、本地高频 cache 等特性、可单机或云端部署
- 支持服务监控,提供基于普罗米修斯的性能数据统计及端口访问
- Integrate high-performance server-side inference engine paddle Inference and mobile-side engine paddle Lite. Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle).
- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline. The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md#21-design-selection).
- Support multiple [protocols](doc/C++_Serving/Inference_Protocols_CN.md) such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK.
- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, request cache, etc.
- Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, HUAWEI Ascend 310/910, HYGON DCU、Nvidia Jetson etc.
- Integrate acceleration libraries of Intel MKLDNN and Nvidia TensorRT, and low-precision and quantitative inference.
- Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice.
- Support cloud deployment, provide a deployment case of Baidu Cloud Intelligent Cloud kubernetes cluster.
- Provide more than 40 classic pre-model deployment examples, such as PaddleOCR, PaddleClas, PaddleDetection, PaddleSeg, PaddleNLP, PaddleRec and other suites, and more models continue to expand.
- Supports distributed deployment of large-scale sparse parameter index models, with features such as multiple tables, multiple shards, multiple copies, local high-frequency cache, etc., and can be deployed on a single machine or clouds.
- Support service monitoring, provide prometheus-based performance statistics and port access
<h2 align="center">教程与案例</h2>
<h2 align="center">Tutorial and Papers</h2>
- AIStudio 使用教程 : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/3946013)
- AIStudio OCR 实战 : [基于Paddle Serving的OCR服务化部署实战](https://aistudio.baidu.com/aistudio/projectdetail/3630726)
- 视频教程 : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
- 边缘 AI 解决方案 : [基于Paddle Serving&百度智能边缘BIE的边缘AI解决方案](https://mp.weixin.qq.com/s/j0EVlQXaZ7qmoz9Fv96Yrw)
- 政务问答解决方案 : [政务问答检索式 FAQ System](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/faq_system)
- 智能问答解决方案 : [保险智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/faq_finance)
- 语义索引解决方案 : [In-batch Negatives](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative)
<h2 align="center">论文</h2>
- AIStudio tutorial(Chinese) : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/3946013)
- AIStudio OCR practice(Chinese) : [基于PaddleServing的OCR服务化部署实战](https://aistudio.baidu.com/aistudio/projectdetail/3630726)
- Video tutorial(Chinese) : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
- Edge AI solution(Chinese) : [基于Paddle Serving&百度智能边缘BIE的边缘AI解决方案](https://mp.weixin.qq.com/s/j0EVlQXaZ7qmoz9Fv96Yrw)
- Paper : [JiZhi: A Fast and Cost-Effective Model-As-A-Service System for
- 论文 : [JiZhi: A Fast and Cost-Effective Model-As-A-Service System for
Web-Scale Online Inference at Baidu](https://arxiv.org/pdf/2106.01674.pdf)
- Paper : [ERNIE 3.0 TITAN: EXPLORING LARGER-SCALE KNOWLEDGE
- 论文 : [ERNIE 3.0 TITAN: EXPLORING LARGER-SCALE KNOWLEDGE
ENHANCED PRE-TRAINING FOR LANGUAGE UNDERSTANDING
AND GENERATION](https://arxiv.org/pdf/2112.12731.pdf)
......@@ -58,116 +61,115 @@ AND GENERATION](https://arxiv.org/pdf/2112.12731.pdf)
<img src="doc/images/demo.gif" width="700">
</p>
<h2 align="center">Documentation</h2>
> Set up
This chapter guides you through the installation and deployment steps. It is strongly recommended to use Docker to deploy Paddle Serving. If you do not use docker, ignore the docker-related steps. Paddle Serving can be deployed on cloud servers using Kubernetes, running on many commonly hardwares such as ARM CPU, Intel CPU, Nvidia GPU, Kunlun XPU. The latest development kit of the develop branch is compiled and generated every day for developers to use.
- [Install Paddle Serving using docker](doc/Install_EN.md)
- [Build Paddle Serving from Source with Docker](doc/Compile_EN.md)
- [Deploy Paddle Serving on Kubernetes(Chinese)](doc/Run_On_Kubernetes_CN.md)
- [Deploy Paddle Serving with Security gateway(Chinese)](doc/Serving_Auth_Docker_CN.md)
- Deploy on more hardwares[[ARM CPU、百度昆仑](doc/Run_On_XPU_EN.md)[华为昇腾](doc/Run_On_NPU_CN.md)[海光DCU](doc/Run_On_DCU_CN.md)[Jetson](doc/Run_On_JETSON_CN.md)]
- [Docker Images](doc/Docker_Images_EN.md)
- [Download Wheel packages](doc/Latest_Packages_EN.md)
> Use
The first step is to call the model save interface to generate a model parameter configuration file (.prototxt), which will be used on the client and server. The second step, read the configuration and startup parameters and start the service. According to API documents and your case, the third step is to write client requests based on the SDK, and test the inference service.
- [Quick Start](doc/Quick_Start_EN.md)
- [Save a servable model](doc/Save_EN.md)
- [Description of configuration and startup parameters](doc/Serving_Configure_EN.md)
- [Guide for RESTful/gRPC/bRPC APIs(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [Infer on quantizative models](doc/Low_Precision_EN.md)
- [Data format of classic models(Chinese)](doc/Process_data_CN.md)
- [Prometheus(Chinese)](doc/Prometheus_CN.md)
- [C++ Serving(Chinese)](doc/C++_Serving/Introduction_CN.md)
- [Protocols(Chinese)](doc/C++_Serving/Inference_Protocols_CN.md)
- [Hot loading models](doc/C++_Serving/Hot_Loading_EN.md)
- [A/B Test](doc/C++_Serving/ABTest_EN.md)
- [Encryption](doc/C++_Serving/Encryption_EN.md)
- [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md)
- [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md)
- [Multiple models in series(Chinese)](doc/C++_Serving/2+_model.md)
- [Request Cache(Chinese)](doc/C++_Serving/Request_Cache_CN.md)
- [Python Pipeline](doc/Python_Pipeline/Pipeline_Design_EN.md)
- [Analyze and optimize performance](doc/Python_Pipeline/Performance_Tuning_EN.md)
- [TensorRT dynamic Shape](doc/TensorRT_Dynamic_Shape_EN.md)
- [Benchmark(Chinese)](doc/Python_Pipeline/Benchmark_CN.md)
- Client SDK
- [Python SDK(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [JAVA SDK](doc/Java_SDK_EN.md)
- [C++ SDK(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [Large-scale sparse parameter server](doc/Cube_Local_EN.md)
<br>
> Developers
For Paddle Serving developers, we provide extended documents such as custom OP, level of detail(LOD) processing.
- [Custom Operators](doc/C++_Serving/OP_EN.md)
- [Processing LoD Data](doc/LOD_EN.md)
- [FAQ(Chinese)](doc/FAQ_CN.md)
<h2 align="center">Model Zoo</h2>
Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples, for a total of 46 models.
<h2 align="center">文档</h2>
> 部署
此章节引导您完成安装和部署步骤,强烈推荐使用Docker部署Paddle Serving,如您不使用docker,省略docker相关步骤。在云服务器上可以使用Kubernetes部署Paddle Serving。在异构硬件如ARM CPU、昆仑XPU上编译或使用Paddle Serving可阅读以下文档。每天编译生成develop分支的最新开发包供开发者使用。
- [使用 Docker 安装 Paddle Serving](doc/Install_CN.md)
- [Linux 原生系统安装 Paddle Serving](doc/Install_Linux_Env_CN.md)
- [源码编译安装 Paddle Serving](doc/Compile_CN.md)
- [Kuberntes集群部署 Paddle Serving](doc/Run_On_Kubernetes_CN.md)
- [部署 Paddle Serving 安全网关](doc/Serving_Auth_Docker_CN.md)
- 异构硬件部署[[ARM CPU、百度昆仑](doc/Run_On_XPU_CN.md)[华为昇腾](doc/Run_On_NPU_CN.md)[海光DCU](doc/Run_On_DCU_CN.md)[Jetson](doc/Run_On_JETSON_CN.md)]
- [Docker 镜像列表](doc/Docker_Images_CN.md)
- [下载 Python Wheels](doc/Latest_Packages_CN.md)
> 使用
安装Paddle Serving后,使用快速开始将引导您运行Serving。第一步,调用模型保存接口,生成模型参数配置文件(.prototxt)用以在客户端和服务端使用;第二步,阅读配置和启动参数并启动服务;第三步,根据API和您的使用场景,基于SDK编写客户端请求,并测试推理服务。您想了解跟多特性的使用场景和方法,请详细阅读以下文档。
- [快速开始](doc/Quick_Start_CN.md)
- [保存用于Paddle Serving的模型和配置](doc/Save_CN.md)
- [配置和启动参数的说明](doc/Serving_Configure_CN.md)
- [RESTful/gRPC/bRPC API指南](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [低精度推理](doc/Low_Precision_CN.md)
- [常见模型数据处理](doc/Process_data_CN.md)
- [普罗米修斯](doc/Prometheus_CN.md)
- [设置 TensorRT 动态shape](doc/TensorRT_Dynamic_Shape_CN.md)
- [C++ Serving 概述](doc/C++_Serving/Introduction_CN.md)
- [异步框架](doc/C++_Serving/Asynchronous_Framwork_CN.md)
- [协议](doc/C++_Serving/Inference_Protocols_CN.md)
- [模型热加载](doc/C++_Serving/Hot_Loading_CN.md)
- [A/B Test](doc/C++_Serving/ABTest_CN.md)
- [加密模型推理服务](doc/C++_Serving/Encryption_CN.md)
- [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md)
- [性能指标](doc/C++_Serving/Benchmark_CN.md)
- [多模型串联](doc/C++_Serving/2+_model.md)
- [请求缓存](doc/C++_Serving/Request_Cache_CN.md)
- [Python Pipeline 概述](doc/Python_Pipeline/Pipeline_Int_CN.md)
- [框架设计](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [核心功能](doc/Python_Pipeline/Pipeline_Features_CN.md)
- [性能优化](doc/Python_Pipeline/Pipeline_Optimize_CN.md)
- [性能指标](doc/Python_Pipeline/Pipeline_Benchmark_CN.md)
- 客户端SDK
- [Python SDK](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [JAVA SDK](doc/Java_SDK_CN.md)
- [C++ SDK](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [大规模稀疏参数索引服务](doc/Cube_Local_CN.md)
> 开发者
为Paddle Serving开发者,提供自定义OP,变长数据处理。
- [自定义OP](doc/C++_Serving/OP_CN.md)
- [变长数据(LoD)处理](doc/LOD_CN.md)
- [常见问答](doc/FAQ_CN.md)
<h2 align="center">模型库</h2>
Paddle Serving与Paddle模型套件紧密配合,实现大量服务化部署,包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例,以及Paddle全链条项目,共计47个模型。
<p align="center">
| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | Paddle Video |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| 8 | 12 | 14 | 2 | 3 | 6 | 1|
| 8 | 12 | 14 | 2 | 3 | 7 | 1 |
</p>
For more model examples, read [Model zoo](doc/Model_Zoo_EN.md)
更多模型示例进入[模型库](doc/Model_Zoo_CN.md)
<p align="center">
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
<img src="doc/images/detection.png" width="350">
</p>
<h2 align="center">社区</h2>
<h2 align="center">Community</h2>
If you want to communicate with developers and other users? Welcome to join us, join the community through the following methods below.
您想要同开发者和其他用户沟通吗?欢迎加入我们,通过如下方式加入社群
### Wechat
- WeChat scavenging
### 微信
- 微信用户请扫码
<p align="center">
<img src="doc/images/wechat_group_1.jpeg" width="250">
</p>
### QQ
- QQ Group(Group No.:697765514)
- 飞桨推理部署交流群(Group No.:697765514)
<p align="center">
<img src="doc/images/qq_group_1.png" width="200">
</p>
> Contribution
> 贡献代码
If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/Contribute_EN.md)
- Thanks to [@loveululu](https://github.com/loveululu) for providing python API of Cube.
- Thanks to [@EtachGu](https://github.com/EtachGu) in updating run docker codes.
- Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command
- Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark
- Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark modifying resize comment error
- Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models
- Thanks to [@Jiaqi Liu](https://github.com/LiuChiachi) for supporting prediction for string list input
- Thanks to [@Bin Lu](https://github.com/Intsigstephon) for adding pp-shitu example
如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines(English)](doc/Contribute_EN.md)
- 感谢 [@w5688414](https://github.com/w5688414) 提供 NLP Ernie Indexing 案例
- 感谢 [@loveululu](https://github.com/loveululu) 提供 Cube python API
- 感谢 [@EtachGu](https://github.com/EtachGu) 更新 docker 使用命令
- 感谢 [@BeyondYourself](https://github.com/BeyondYourself) 提供grpc教程,更新FAQ教程,整理文件目录。
- 感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本
- 感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误
- 感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型
- 感谢 [@Jiaqi Liu](https://github.com/LiuChiachi) 新增list[str]类型输入的预测支持
- 感谢 [@Bin Lu](https://github.com/Intsigstephon) 提供PP-Shitu C++模型示例
> Feedback
> 反馈
For any feedback or to report a bug, please propose a [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues).
如有任何反馈或是bug,请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交
> License
[Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE)
[Apache 2.0 License](https://github.com/PaddlePaddle/Serving/blob/develop/LICENSE)
\ No newline at end of file
(简体中文|[English](./README.md))
([简体中文](./README_CN.md)|English)
<p align="center">
<br>
......@@ -24,31 +24,37 @@
***
Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发者和企业提供高性能、灵活易用的工业级在线推理服务。Paddle Serving支持RESTful、gRPC、bRPC等多种协议,提供多种异构硬件和多种操作系统环境下推理解决方案,和多种经典预训练模型示例。核心特性如下:
The goal of Paddle Serving is to provide high-performance, flexible and easy-to-use industrial-grade online inference services for machine learning developers and enterprises.Paddle Serving supports multiple protocols such as RESTful, gRPC, bRPC, and provides inference solutions under a variety of hardware and multiple operating system environments, and many famous pre-trained model examples. The core features are as follows:
- 集成高性能服务端推理引擎paddle Inference和移动端引擎paddle Lite,其他机器学习平台(Caffe/TensorFlow/ONNX/PyTorch)可通过[x2paddle](https://github.com/PaddlePaddle/X2Paddle)工具迁移模型
- 具有高性能C++和高易用Python 2套框架。C++框架基于高性能bRPC网络框架打造高吞吐、低延迟的推理服务,性能领先竞品。Python框架基于gRPC/gRPC-Gateway网络框架和Python语言构建高易用、高吞吐推理服务框架。技术选型参考[技术选型](doc/Serving_Design_CN.md#21-设计选型)
- 支持HTTP、gRPC、bRPC等多种[协议](doc/C++_Serving/Inference_Protocols_CN.md);提供C++、Python、Java语言SDK
- 设计并实现基于有向无环图(DAG)的异步流水线高性能推理框架,具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性
- 适配x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑XPU、华为昇腾310/910、海光DCU、Nvidia Jetson等多种硬件
- 集成Intel MKLDNN、Nvidia TensorRT加速库,以及低精度和量化推理
- 提供一套模型安全部署解决方案,包括加密模型部署、鉴权校验、HTTPs安全网关,并在实际项目中应用
- 支持云端部署,提供百度云智能云kubernetes集群部署Paddle Serving案例
- 提供丰富的经典模型部署示例,如PaddleOCR、PaddleClas、PaddleDetection、PaddleSeg、PaddleNLP、PaddleRec等套件,共计40+个预训练精品模型
- 支持大规模稀疏参数索引模型分布式部署,具有多表、多分片、多副本、本地高频cache等特性、可单机或云端部署
- 支持服务监控,提供基于普罗米修斯的性能数据统计及端口访问
- Integrate high-performance server-side inference engine [Paddle Inference](https://paddleinference.paddlepaddle.org.cn/product_introduction/inference_intro.html) and mobile-side engine [Paddle Lite](https://paddlelite.paddlepaddle.org.cn/introduction/tech_highlights.html). Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle).
- There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline. The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md#21-design-selection).
- Support multiple [protocols](doc/C++_Serving/Inference_Protocols_CN.md) such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK.
- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, request cache, etc.
- Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, HUAWEI Ascend 310/910, HYGON DCU、Nvidia Jetson etc.
- Integrate acceleration libraries of Intel MKLDNN and Nvidia TensorRT, and low-precision and quantitative inference.
- Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice.
- Support cloud deployment, provide a deployment case of Baidu Cloud Intelligent Cloud kubernetes cluster.
- Provide more than 40 classic pre-model deployment examples, such as PaddleOCR, PaddleClas, PaddleDetection, PaddleSeg, PaddleNLP, PaddleRec and other suites, and more models continue to expand.
- Supports distributed deployment of large-scale sparse parameter index models, with features such as multiple tables, multiple shards, multiple copies, local high-frequency cache, etc., and can be deployed on a single machine or clouds.
- Support service monitoring, provide prometheus-based performance statistics and port access
<h2 align="center">教程与论文</h2>
- AIStudio 使用教程 : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/3946013)
- AIStudio OCR实战 : [基于PaddleServing的OCR服务化部署实战](https://aistudio.baidu.com/aistudio/projectdetail/3630726)
- 视频教程 : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
- 边缘AI 解决方案 : [基于Paddle Serving&百度智能边缘BIE的边缘AI解决方案](https://mp.weixin.qq.com/s/j0EVlQXaZ7qmoz9Fv96Yrw)
<h2 align="center">Tutorial and Solutions</h2>
- 论文 : [JiZhi: A Fast and Cost-Effective Model-As-A-Service System for
- AIStudio tutorial(Chinese) : [Paddle Serving服务化部署框架](https://www.paddlepaddle.org.cn/tutorials/projectdetail/3946013)
- AIStudio OCR practice(Chinese) : [基于PaddleServing的OCR服务化部署实战](https://aistudio.baidu.com/aistudio/projectdetail/3630726)
- Video tutorial(Chinese) : [深度学习服务化部署-以互联网应用为例](https://aistudio.baidu.com/aistudio/course/introduce/19084)
- Edge AI solution(Chinese) : [基于Paddle Serving&百度智能边缘BIE的边缘AI解决方案](https://mp.weixin.qq.com/s/j0EVlQXaZ7qmoz9Fv96Yrw)
- GOVT Q&A Solution(Chinese) : [政务问答检索式 FAQ System](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/faq_system)
- Smart Q&A Solution(Chinese) : [保险智能问答](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/question_answering/faq_finance)
- Semantic Indexing Solution(Chinese) : [In-batch Negatives](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative)
<h2 align="center">Papers</h2>
- Paper : [JiZhi: A Fast and Cost-Effective Model-As-A-Service System for
Web-Scale Online Inference at Baidu](https://arxiv.org/pdf/2106.01674.pdf)
- 论文 : [ERNIE 3.0 TITAN: EXPLORING LARGER-SCALE KNOWLEDGE
- Paper : [ERNIE 3.0 TITAN: EXPLORING LARGER-SCALE KNOWLEDGE
ENHANCED PRE-TRAINING FOR LANGUAGE UNDERSTANDING
AND GENERATION](https://arxiv.org/pdf/2112.12731.pdf)
......@@ -56,109 +62,117 @@ AND GENERATION](https://arxiv.org/pdf/2112.12731.pdf)
<img src="doc/images/demo.gif" width="700">
</p>
<h2 align="center">文档</h2>
> 部署
此章节引导您完成安装和部署步骤,强烈推荐使用Docker部署Paddle Serving,如您不使用docker,省略docker相关步骤。在云服务器上可以使用Kubernetes部署Paddle Serving。在异构硬件如ARM CPU、昆仑XPU上编译或使用Paddle Serving可阅读以下文档。每天编译生成develop分支的最新开发包供开发者使用。
- [使用docker安装Paddle Serving](doc/Install_CN.md)
- [源码编译安装Paddle Serving](doc/Compile_CN.md)
- [在Kuberntes集群上部署Paddle Serving](doc/Run_On_Kubernetes_CN.md)
- [部署Paddle Serving安全网关](doc/Serving_Auth_Docker_CN.md)
- 异构硬件部署[[ARM CPU、百度昆仑](doc/Run_On_XPU_CN.md)[华为昇腾](doc/Run_On_NPU_CN.md)[海光DCU](doc/Run_On_DCU_CN.md)[Jetson](doc/Run_On_JETSON_CN.md)]
- [Docker镜像](doc/Docker_Images_CN.md)
- [下载Wheel包](doc/Latest_Packages_CN.md)
> 使用
安装Paddle Serving后,使用快速开始将引导您运行Serving。第一步,调用模型保存接口,生成模型参数配置文件(.prototxt)用以在客户端和服务端使用;第二步,阅读配置和启动参数并启动服务;第三步,根据API和您的使用场景,基于SDK编写客户端请求,并测试推理服务。您想了解跟多特性的使用场景和方法,请详细阅读以下文档。
- [快速开始](doc/Quick_Start_CN.md)
- [保存用于Paddle Serving的模型和配置](doc/Save_CN.md)
- [配置和启动参数的说明](doc/Serving_Configure_CN.md)
- [RESTful/gRPC/bRPC API指南](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [低精度推理](doc/Low_Precision_CN.md)
- [常见模型数据处理](doc/Process_data_CN.md)
- [普罗米修斯](doc/Prometheus_CN.md)
- [C++ Serving简介](doc/C++_Serving/Introduction_CN.md)
- [协议](doc/C++_Serving/Inference_Protocols_CN.md)
- [模型热加载](doc/C++_Serving/Hot_Loading_CN.md)
- [A/B Test](doc/C++_Serving/ABTest_CN.md)
- [加密模型推理服务](doc/C++_Serving/Encryption_CN.md)
- [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md)
- [性能指标](doc/C++_Serving/Benchmark_CN.md)
- [多模型串联](doc/C++_Serving/2+_model.md)
- [请求缓存](doc/C++_Serving/Request_Cache_CN.md)
- [Python Pipeline设计](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [性能优化指南](doc/Python_Pipeline/Performance_Tuning_CN.md)
- [TensorRT动态shape](doc/TensorRT_Dynamic_Shape_CN.md)
- [性能指标](doc/Python_Pipeline/Benchmark_CN.md)
- 客户端SDK
- [Python SDK](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [JAVA SDK](doc/Java_SDK_CN.md)
- [C++ SDK](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [大规模稀疏参数索引服务](doc/Cube_Local_CN.md)
> 开发者
为Paddle Serving开发者,提供自定义OP,变长数据处理。
- [自定义OP](doc/C++_Serving/OP_CN.md)
- [变长数据(LoD)处理](doc/LOD_CN.md)
- [常见问答](doc/FAQ_CN.md)
<h2 align="center">模型库</h2>
Paddle Serving与Paddle模型套件紧密配合,实现大量服务化部署,包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例,以及Paddle全链条项目,共计45个模型。
<h2 align="center">Documentation</h2>
> Set up
This chapter guides you through the installation and deployment steps. It is strongly recommended to use Docker to deploy Paddle Serving. If you do not use docker, ignore the docker-related steps. Paddle Serving can be deployed on cloud servers using Kubernetes, running on many commonly hardwares such as ARM CPU, Intel CPU, Nvidia GPU, Kunlun XPU. The latest development kit of the develop branch is compiled and generated every day for developers to use.
- [Install Paddle Serving using docker](doc/Install_EN.md)
- [Build Paddle Serving from Source with Docker](doc/Compile_EN.md)
- [Install Paddle Serving on linux system](doc/Install_Linux_Env_CN.md)
- [Deploy Paddle Serving on Kubernetes(Chinese)](doc/Run_On_Kubernetes_CN.md)
- [Deploy Paddle Serving with Security gateway(Chinese)](doc/Serving_Auth_Docker_CN.md)
- Deploy on more hardwares[[ARM CPU、百度昆仑](doc/Run_On_XPU_EN.md)[华为昇腾](doc/Run_On_NPU_CN.md)[海光DCU](doc/Run_On_DCU_CN.md)[Jetson](doc/Run_On_JETSON_CN.md)]
- [Docker Images](doc/Docker_Images_EN.md)
- [Download Wheel packages](doc/Latest_Packages_EN.md)
> Use
The first step is to call the model save interface to generate a model parameter configuration file (.prototxt), which will be used on the client and server. The second step, read the configuration and startup parameters and start the service. According to API documents and your case, the third step is to write client requests based on the SDK, and test the inference service.
- [Quick Start](doc/Quick_Start_EN.md)
- [Save a servable model](doc/Save_EN.md)
- [Description of configuration and startup parameters](doc/Serving_Configure_EN.md)
- [Guide for RESTful/gRPC/bRPC APIs(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [Infer on quantizative models](doc/Low_Precision_EN.md)
- [Data format of classic models(Chinese)](doc/Process_data_CN.md)
- [Prometheus(Chinese)](doc/Prometheus_CN.md)
- [C++ Serving(Chinese)](doc/C++_Serving/Introduction_CN.md)
- [Protocols(Chinese)](doc/C++_Serving/Inference_Protocols_CN.md)
- [Hot loading models](doc/C++_Serving/Hot_Loading_EN.md)
- [A/B Test](doc/C++_Serving/ABTest_EN.md)
- [Encryption](doc/C++_Serving/Encryption_EN.md)
- [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md)
- [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md)
- [Multiple models in series(Chinese)](doc/C++_Serving/2+_model.md)
- [Request Cache(Chinese)](doc/C++_Serving/Request_Cache_CN.md)
- [Python Pipeline Overview(Chinese)](doc/Python_Pipeline/Pipeline_Int_CN.md)
- [Architecture Design(Chinese)](doc/Python_Pipeline/Pipeline_Design_CN.md)
- [Core Features(Chinese)](doc/Python_Pipeline/Pipeline_Features_CN.md)
- [Performance Optimization(Chinese)](doc/Python_Pipeline/Pipeline_Optimize_CN.md)
- [Benchmark(Chinese)](doc/Python_Pipeline/Pipeline_Benchmark_CN.md)
- Client SDK
- [Python SDK(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [JAVA SDK](doc/Java_SDK_EN.md)
- [C++ SDK(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
- [Large-scale sparse parameter server](doc/Cube_Local_EN.md)
<br>
> Developers
For Paddle Serving developers, we provide extended documents such as custom OP, level of detail(LOD) processing.
- [Custom Operators](doc/C++_Serving/OP_EN.md)
- [Processing LoD Data](doc/LOD_EN.md)
- [FAQ(Chinese)](doc/FAQ_CN.md)
<h2 align="center">Model Zoo</h2>
Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples, for a total of 46 models.
<p align="center">
| PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | Paddle Video |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| 8 | 12 | 14 | 2 | 3 | 6 | 1 |
| 8 | 12 | 14 | 2 | 3 | 6 | 1|
</p>
更多模型示例进入[模型库](doc/Model_Zoo_CN.md)
For more model examples, read [Model zoo](doc/Model_Zoo_EN.md)
<p align="center">
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
<img src="https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/imgs_results/PP-OCRv2/PP-OCRv2-pic003.jpg?raw=true" width="345"/>
<img src="doc/images/detection.png" width="350">
</p>
<h2 align="center">社区</h2>
<h2 align="center">Community</h2>
您想要同开发者和其他用户沟通吗?欢迎加入我们,通过如下方式加入社群
If you want to communicate with developers and other users? Welcome to join us, join the community through the following methods below.
### 微信
- 微信用户请扫码
### Wechat
- WeChat scavenging
<p align="center">
<img src="doc/images/wechat_group_1.jpeg" width="250">
</p>
### QQ
- 飞桨推理部署交流群(Group No.:697765514)
- QQ Group(Group No.:697765514)
<p align="center">
<img src="doc/images/qq_group_1.png" width="200">
</p>
> 贡献代码
> Contribution
如果您想为Paddle Serving贡献代码,请参考 [Contribution Guidelines(English)](doc/Contribute_EN.md)
- 感谢 [@loveululu](https://github.com/loveululu) 提供 Cube python API
- 感谢 [@EtachGu](https://github.com/EtachGu) 更新 docker 使用命令
- 感谢 [@BeyondYourself](https://github.com/BeyondYourself) 提供grpc教程,更新FAQ教程,整理文件目录。
- 感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本
- 感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误
- 感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型
- 感谢 [@Jiaqi Liu](https://github.com/LiuChiachi) 新增list[str]类型输入的预测支持
- 感谢 [@Bin Lu](https://github.com/Intsigstephon) 提供PP-Shitu C++模型示例
If you want to contribute code to Paddle Serving, please reference [Contribution Guidelines](doc/Contribute_EN.md)
- Thanks to [@loveululu](https://github.com/loveululu) for providing python API of Cube.
- Thanks to [@EtachGu](https://github.com/EtachGu) in updating run docker codes.
- Thanks to [@BeyondYourself](https://github.com/BeyondYourself) in complementing the gRPC tutorial, updating the FAQ doc and modifying the mdkir command
- Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark
- Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark modifying resize comment error
- Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models
- Thanks to [@Jiaqi Liu](https://github.com/LiuChiachi) for supporting prediction for string list input
- Thanks to [@Bin Lu](https://github.com/Intsigstephon) for adding pp-shitu example
> 反馈
> Feedback
如有任何反馈或是bug,请在 [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues)提交
For any feedback or to report a bug, please propose a [GitHub Issue](https://github.com/PaddlePaddle/Serving/issues).
> License
......
......@@ -39,7 +39,7 @@ if (WITH_GPU)
set(WITH_TRT ON)
elseif(CUDA_VERSION EQUAL 10.2)
if(CUDNN_MAJOR_VERSION EQUAL 7)
set(CUDA_SUFFIX "x86-64_gcc5.4_avx_mkl_cuda10.2_cudnn7.6.5_trt6.0.1.5")
set(CUDA_SUFFIX "x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn7.6.5_trt6.0.1.5")
set(WITH_TRT ON)
elseif(CUDNN_MAJOR_VERSION EQUAL 8)
set(CUDA_SUFFIX "x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4")
......
# 如何使用Paddle Serving做ABTEST
# C++ Serving ABTest
(简体中文|[English](./ABTest_EN.md))
- [功能设计](#1)
- [使用案例](#2)
- [1.1 安装 Paddle Serving Wheels](#2.1)
- [1.2 下载多个模型并保存模型参数](#2.2)
- [1.3 启动 A,B,C 3个服务](#2.3)
- [1.4 客户端注册 A,B,C 服务端地址](#2.4)
- [1.5 启动客户端并验证结果](#2.5)
该文档将会用一个基于IMDB数据集的文本分类任务的例子,介绍如何使用Paddle Serving搭建A/B Test框架,例中的Client端、Server端结构如下图所示
ABTest 是一种功能测试方案,一般是为同一个产品目标制定多种方案,让一部分用户使用 A 方案,另一部分用户使用 B 或 C 方案,根据测试效果,如点击率、转化率等来评价方案的优劣
<img src="../images/abtest.png" style="zoom:33%;" />
模型服务化部署框架中,ABTest 属于一个重要的基础功能,为模型迭代升级提供实验环境。Paddle Serving 的 PYTHON SDK 中实现 ABTest 功能,为用户提供简单易用功能测试环境。
需要注意的是:A/B Test只适用于RPC模式,不适用于WEB模式。
<a name="1"></a>
### 下载数据以及模型
## 功能设计
``` shell
cd Serving/examples/C++/imdb
sh get_data.sh
```
Paddle Serving 的 ABTest 功能是基于 PYTHON SDK 和 多个服务端构成。每个服务端加载不同模型,在客户端上注册多个服务端地址和访问比例,最终确定访问。
<div align=center>
<img src='../images/6-5_Cpp_ABTest_CN_1.png' height = "400" align="middle"/>
</div
<a name="2"></a>
### 处理数据
由于处理数据需要用到相关库,请使用pip进行安装
``` shell
pip install paddlepaddle
pip install paddle-serving-app
pip install Shapely
````
您可以直接运行下面的命令来处理数据。
## 使用案例
[python abtest_get_data.py](../../examples/C++/imdb/abtest_get_data.py)
[imdb](https://github.com/PaddlePaddle/Serving/tree/develop/examples/C%2B%2B/imdb) 示例为例,介绍 ABTest 的使用,部署有5个步骤:
文件中的Python代码将处理`test_data/part-0`的数据,并将处理后的数据生成并写入`processed.data`文件中。
1. 安装 Paddle Serving Wheels
2. 下载多个模型并保存模型参数
3. 启动 A,B,C 3个服务
4. 客户端注册 A,B,C 服务端地址
5. 启动客户端并验证结果
### 启动Server端
<a name="2.1"></a>
这里采用[Docker方式](../Install_CN.md)启动Server端服务。
**一.安装 Paddle Serving Wheels**
首先启动BOW Server,该服务启用`8000`端口
使用 ABTest 功能的前提是使用 PYTHON SDK,因此需要安装 `paddle_serving_client` 的 wheel 包。[安装方法](../Docker_Images_CN.md) 如下
```bash
docker run -dit -v $PWD/imdb_bow_model:/model -p 8000:8000 --name bow-server registry.baidubce.com/paddlepaddle/serving:latest /bin/bash
docker exec -it bow-server /bin/bash
pip install paddle-serving-server -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install paddle-serving-client -i https://pypi.tuna.tsinghua.edu.cn/simple
python -m paddle_serving_server.serve --model model --port 8000 >std.log 2>err.log &
exit
```
pip3 install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
<a name="2.2"></a>
**二.下载多个模型并保存模型参数**
同理启动LSTM Server,该服务启用`9000`端口:
本示例已提供了一键下载脚本 `sh get_data.sh`,下载自训练的模型 `bow``cnn``lstm` 3种不同方式训练的模型。
```bash
docker run -dit -v $PWD/imdb_lstm_model:/model -p 9000:9000 --name lstm-server registry.baidubce.com/paddlepaddle/serving:latest /bin/bash
docker exec -it lstm-server /bin/bash
pip install paddle-serving-server -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install paddle-serving-client -i https://pypi.tuna.tsinghua.edu.cn/simple
python -m paddle_serving_server.serve --model model --port 9000 >std.log 2>err.log &
exit
```
sh get_data.sh
```
### 启动Client端
为了模拟ABTEST工况,您可以在宿主机运行下面Python代码启动Client端,但需确保宿主机具备相关环境,您也可以在docker环境下运行.
3种模型的所有文件如下所示,已为用户提前保存模型参数,无需执行保存操作。
```
├── imdb_bow_client_conf
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── imdb_bow_model
│   ├── embedding_0.w_0
│   ├── fc_0.b_0
│   ├── fc_0.w_0
│   ├── fc_1.b_0
│   ├── fc_1.w_0
│   ├── fc_2.b_0
│   ├── fc_2.w_0
│   ├── fluid_time_file
│   ├── __model__
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── imdb_cnn_client_conf
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── imdb_cnn_model
│   ├── embedding_0.w_0
│   ├── fc_0.b_0
│   ├── fc_0.w_0
│   ├── fc_1.b_0
│   ├── fc_1.w_0
│   ├── fluid_time_file
│   ├── __model__
│   ├── sequence_conv_0.b_0
│   ├── sequence_conv_0.w_0
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── imdb_lstm_client_conf
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── imdb_lstm_model
│   ├── embedding_0.w_0
│   ├── fc_0.b_0
│   ├── fc_0.w_0
│   ├── fc_1.b_0
│   ├── fc_1.w_0
│   ├── fc_2.b_0
│   ├── fc_2.w_0
│   ├── lstm_0.b_0
│   ├── lstm_0.w_0
│   ├── __model__
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
```
运行前使用`pip install paddle-serving-client`安装paddle-serving-client包。
虽然3个模型的网络结构不同,但是 `feed var``fetch_var` 都是相同的便于做 ABTest。
```
feed_var {
name: "words"
alias_name: "words"
is_lod_tensor: true
feed_type: 0
shape: -1
}
fetch_var {
name: "fc_2.tmp_2"
alias_name: "prediction"
is_lod_tensor: false
fetch_type: 1
shape: 2
}
```
<a name="2.3"></a>
您可以直接使用下面的命令,进行ABTEST预测。
**三.启动 A,B,C 3个服务**
[python abtest_client.py](../../examples/C++/imdb/abtest_client.py)
后台启动 `bow``cnn``lstm` 模型服务:
```python
## 启动 bow 模型服务
python3 -m paddle_serving_server.serve --model imdb_bow_model/ --port 9297 >/dev/null 2>&1 &
## 启动 cnn 模型服务
python3 -m paddle_serving_server.serve --model imdb_cnn_model/ --port 9298 >/dev/null 2>&1 &
## 启动 lstm 模型服务
python3 -m paddle_serving_server.serve --model imdb_lstm_model/ --port 9299 >/dev/null 2>&1 &
```
<a name="2.4"></a>
**四.客户端注册 A,B,C 服务端地址**
使用 `paddle_serving_client``Client::add_variant(self, tag, cluster, variant_weight)` 接口注册服务标签、服务地址和权重。框架会将所有权重求和后计算每个服务的比例。本示例中,bow 服务的权重是10,cnn 服务的权重是30, lstm的权重是60,每次请求分别请求到3个服务的比例是10%、30%和60%。
```
from paddle_serving_client import Client
from paddle_serving_app.reader.imdb_reader import IMDBDataset
import sys
import numpy as np
client = Client()
client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt')
client.add_variant("bow", ["127.0.0.1:8000"], 10)
client.add_variant("lstm", ["127.0.0.1:9000"], 90)
client.load_client_config(sys.argv[1])
client.add_variant("bow", ["127.0.0.1:9297"], 10)
client.add_variant("cnn", ["127.0.0.1:9298"], 30)
client.add_variant("lstm", ["127.0.0.1:9299"], 60)
client.connect()
```
如要在结果中打印请求到了哪个服务,在 `client.predict(feed, fetch, batch, need_variant_tag, logid)` 中设置 `need_variant_tag=True`
<a name="2.5"></a>
print('please wait for about 10s')
with open('processed.data') as f:
cnt = {"bow": {'acc': 0, 'total': 0}, "lstm": {'acc': 0, 'total': 0}}
for line in f:
word_ids, label = line.split(';')
word_ids = [int(x) for x in word_ids.split(',')]
word_len = len(word_ids)
feed = {
"words": np.array(word_ids).reshape(word_len, 1),
"words.lod": [0, word_len]
}
fetch = ["acc", "cost", "prediction"]
[fetch_map, tag] = client.predict(feed=feed, fetch=fetch, need_variant_tag=True,batch=True)
if (float(fetch_map["prediction"][0][1]) - 0.5) * (float(label[0]) - 0.5) > 0:
cnt[tag]['acc'] += 1
cnt[tag]['total'] += 1
for tag, data in cnt.items():
print('[{}]<total: {}> acc: {}'.format(tag, data['total'], float(data['acc'])/float(data['total']) ))
**五.启动客户端并验证结果**
运行命令:
```
head test_data/part-0 | python3.7 abtest_client.py imdb_cnn_client_conf/serving_client_conf.prototxt imdb.vocab
```
代码中,`client.add_variant(tag, clusters, variant_weight)`是为了添加一个标签为`tag`、流量权重为`variant_weight`的variant。在这个样例中,添加了一个标签为`bow`、流量权重为`10`的BOW variant,以及一个标签为`lstm`、流量权重为`90`的LSTM variant。Client端的流量会根据`10:90`的比例分发到两个variant。
Client端做预测时,若指定参数`need_variant_tag=True`,返回值则包含分发流量对应的variant标签。
运行结果如下,10次请求中,bow 服务2次,cnn 服务3次,lstm 服务5次,与设置的比例基本相近。
```
I0506 04:02:46.720135 44567 naming_service_thread.cpp:202] brpc::policy::ListNamingService("127.0.0.1:9297"): added 1
I0506 04:02:46.722630 44567 naming_service_thread.cpp:202] brpc::policy::ListNamingService("127.0.0.1:9298"): added 1
I0506 04:02:46.723577 44567 naming_service_thread.cpp:202] brpc::policy::ListNamingService("127.0.0.1:9299"): added 1
I0506 04:02:46.814075 44567 general_model.cpp:490] [client]logid=0,client_cost=9.889ms,server_cost=6.283ms.
server_tag=lstm prediction=[0.500398 0.49960205]
I0506 04:02:46.826339 44567 general_model.cpp:490] [client]logid=0,client_cost=10.261ms,server_cost=9.503ms.
server_tag=lstm prediction=[0.5007235 0.49927652]
I0506 04:02:46.828992 44567 general_model.cpp:490] [client]logid=0,client_cost=1.667ms,server_cost=0.741ms.
server_tag=bow prediction=[0.25859657 0.74140346]
I0506 04:02:46.843299 44567 general_model.cpp:490] [client]logid=0,client_cost=13.402ms,server_cost=12.827ms.
server_tag=lstm prediction=[0.50039905 0.4996009 ]
I0506 04:02:46.850219 44567 general_model.cpp:490] [client]logid=0,client_cost=5.129ms,server_cost=4.332ms.
server_tag=cnn prediction=[0.6369219 0.36307803]
I0506 04:02:46.854203 44567 general_model.cpp:490] [client]logid=0,client_cost=2.804ms,server_cost=0.782ms.
server_tag=bow prediction=[0.15088597 0.849114 ]
I0506 04:02:46.858268 44567 general_model.cpp:490] [client]logid=0,client_cost=3.292ms,server_cost=2.677ms.
server_tag=cnn prediction=[0.4608788 0.5391212]
I0506 04:02:46.869217 44567 general_model.cpp:490] [client]logid=0,client_cost=10.13ms,server_cost=9.556ms.
server_tag=lstm prediction=[0.5000269 0.49997318]
I0506 04:02:46.883790 44567 general_model.cpp:490] [client]logid=0,client_cost=13.312ms,server_cost=12.822ms.
server_tag=lstm prediction=[0.50083774 0.49916226]
I0506 04:02:46.887256 44567 general_model.cpp:490] [client]logid=0,client_cost=2.432ms,server_cost=1.812ms.
server_tag=cnn prediction=[0.47895813 0.52104187]
### 预期结果
由于网络情况的不同,可能每次预测的结果略有差异。
``` bash
[lstm]<total: 1867> acc: 0.490091055169
[bow]<total: 217> acc: 0.73732718894
```
# C++ Serving 异步模式
- [设计方案](#1)
- [网络同步线程](#1.1)
- [异步调度线程](#1.2)
- [动态批量](#1.3)
- [使用案例](#2)
- [开启同步模式](#2.1)
- [开启异步模式](#2.2)
- [性能测试](#3)
- [测试结果](#3.1)
- [测试数据](#3.2)
<a name="1"></a>
## 设计方案
<a name="1.1"></a>
**一.同步网络线程**
Paddle Serving 的网络框架层面是同步处理模式,即 bRPC 网络处理线程从系统内核拿到完整请求数据后( epoll 模式),在同一线程内完成业务处理,C++ Serving 默认使用同步模式。同步模式比较简单直接,适用于模型预测时间短,或单个 Request 请求批量较大的情况。
<p align="center">
<img src='../images/syn_mode.png' width = "350" height = "300">
<p>
Server 端线程数 N = 模型预测引擎数 N = 同时处理 Request 请求数 N,超发的 Request 请求需要等待当前线程处理结束后才能得到响应和处理。
<a name="1.2"></a>
**二.异步调度线程**
为了提高计算芯片吞吐和计算资源利用率,C++ Serving 在调度层实现异步多线程并发合并请求,实现动态批量推理。异步模型主要适用于模型支持批量,单个 Request 请求的无批量或较小,单次预测时间较长的情况。
<p align="center">
<img src='../images/asyn_mode.png'>
<p>
异步模式下,Server 端 N 个线程只负责接收 Request 请求,实际调用预测引擎是在异步框架的线程池中,异步框架的线程数可以由配置选项来指定。为了方便理解,我们假设每个 Request 请求批量均为1,此时异步框架会尽可能多得从请求池中取 n(n≤M)个 Request 并将其拼装为1个 Request(batch=n),调用1次预测引擎,得到1个 Response(batch = n),再将其对应拆分为 n 个 Response 作为返回结果。
<a name="1.3"></a>
**三.动态批量**
通常,异步框架合并多个请求的前提是所有请求的 `feed var` 的维度除 batch 维度外必须是相同的。例如,以 OCR 文字识别案例中检测模型为例,A 请求的 `x` 变量的 shape 是 [1, 3, 960, 960],B 请求的 `x` 变量的 shape 是 [2, 3, 960, 960],虽然第一个维度值不相同,但第一个维度属于 `batch` 维度,因此,请求 A 和 请求 B 可以合并。C 请求的 `x` 变量的 shape 是 [1, 3, 640, 480],由于除了 `batch` 维度外还有2个维度值不同,A 和 C 不能直接合并。
从经验来看,当2个请求的同一个变量 shape 维度的数量相等时,通过 `padding` 补0的方式按最大 shape 值对齐即可。即 C 请求的 shape 补齐到 [1, 3, 960, 960],那么就可以与 A 和 B 请求合并了。Paddle Serving 框架实现了动态 Padding 功能补齐 shape。
当多个将要合并的请求中有一个 shape 值很大时,所有请求的 shape 都要按最大补齐,导致计算量成倍增长。Paddle Serving 设计了一套合并策略,满足任何一个条件均可合并:
- 条件 1:绝对值差的字节数小于 **1024** 字节,评估补齐绝对长度
- 条件 2:相似度的乘积大于 **50%**,评估相似度,评估补齐绝对值整体数据量比例
场景1:`Shape-1 = [batch, 500, 500], Shape-2 = [batch, 400, 400]`。此时,`绝对值差 = 500*500 - 400*400 = 90000` 字节,`相对误差= (400/500) * (400/500) = 0.8*0.8 = 0.64`,满足条件1,不满足条件2,触发动态 Padding。
场景2:`Shape-1 = [batch, 1, 1], Shape-2 = [batch, 2, 2]`。此时,`绝对值差 = 2*2 - 1*1 = 3`字节,`相对误差 = (1/2) * (1/2) = 0.5*0.5 = 0.25`,满足条件2,不满足条件1,触发动态 Padding。
场景3:`Shape-1 = [batch, 3, 320, 320], Shape-2 = [batch, 3, 960, 960]`。此时,`绝对值差 = 3*960*960 - 3*320*320 = 2457600`字节,`相对误差 = (3/3) * (320/960) * (320/960) = 0.3*0.3 = 0.09`,条件1和条件2均不满足,未触发动态 Padding。
<a name="2"></a>
## 使用案例
<a name="2.1"></a>
**一.开启同步模式**
启动命令不使用 `--runtime_thread_num``--batch_infer_size` 时,属于同步处理模式,未开启异步模式。`--thread 16` 表示启动16个同步网络处理线程。
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 16 --port 9292
```
<a name="2.2"></a>
**二.开启异步模式**
启动命令使用 `--runtime_thread_num 2``--batch_infer_size 32` 开启异步模式,Serving 框架会启动2个异步线程,单次合并最大批量为32,自动开启动态 Padding。
```
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 16 --port 9292 --runtime_thread_num 4 --batch_infer_size 32 --ir_optim --gpu_multi_stream --gpu_ids 0
```
<a name="3"></a>
## 性能测试
- GPU:Tesla P4 7611 MiB
- CUDA:cuda11.2-cudnn8-trt8
- Python 版本:python3.7
- 模型:ResNet_v2_50
- 测试数据:构造全1输入,单client请求100次,shape 范围(1, 224 ± 50, 224 ± 50)
同步模式启动命令:
```
python3 -m paddle_serving_server.serve --model resnet_v2_50_imagenet_model --port 9393 --thread 8 --ir_optim --gpu_multi_stream --gpu_ids 1 --enable_prometheus --prometheus_port 1939
```
异步模式启动命令:
```
python3 -m paddle_serving_server.serve --model resnet_v2_50_imagenet_model --port 9393 --thread 64 --runtime_thread_num 8 --ir_optim --gpu_multi_stream --gpu_ids 1 --enable_prometheus --prometheus_port 19393
```
<a name="3.1"></a>
**一.测试结果**
使用异步模式,并开启动态批量后,并发测试不同 shape 数据时,吞吐性能大幅提升。
<div align=center>
<img src='../images/6-1_Cpp_Asynchronous_Framwork_CN_1.png' height = "600" align="middle"/>
</div
由于动态批量导致响应时长增长,经过测试,大多数场景下吞吐增量高于响应时长增长,尤其在高并发场景(client=70时),在响应时长增长 33% 情况下,吞吐增加 105%。
|Client |1 |5 |10 | 20 |30 |40 |50 |70 |
|---|---|---|---|---|---|---|---|---|
|QPS |-2.08% |-7.23% |-1.89% |20.55% |23.02% |23.34% |46.41% |105.27% |
|响应时长 | 2.70% |7.09% |5.24% |13.34% |10.80% |43.60% |8.72% |33.89% |
异步模式可有效提升服务吞吐性能。
<a name="3.2"></a>
**二.测试数据**
1. 同步模式
| client_num | batch_size |CPU_util_pre(%) |CPU_util(%) |GPU_memory(mb) |GPU_util(%) |qps(samples/s) |total count |mean(ms) |median(ms) |80 percent(ms) |90 percent(ms) |99 percent(ms) |total cost(s) |each cost(s)|infer_count_total|infer_cost_total(ms)|infer_cost_avg(ms)|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 |1 |1.30 |18.90 |2066 |71.56 |22.938 |100 |43.594 |23.516 |78.118 |78.323 |133.544 |4.4262 |4.3596 |7100.0000 |1666392.70 | 41.1081 |
| 5 |1 |2.00 |28.20 |3668 |92.57 |33.630 |500 |148.673 |39.531 |373.231 |396.306 |419.088 |15.0606 |14.8676 |7600.0000 |1739372.7480| 145.9601 |
|10 |1 |1.90 |29.80 |4202 |91.98 |34.303 |1000 |291.512 |76.728 |613.963 |632.736 |1217.863 |29.8004 |29.1516 |8600.0000 |1974147.7420| 234.7750 |
|20 |1 |4.70 |49.60 |4736 |92.63 |34.359 |2000 |582.089 |154.952 |1239.115 |1813.371 |1858.128 |59.7303 |58.2093 |12100.0000 |2798459.6330 |235.6248 |
|30 |1 |5.70 |65.70 |4736 |92.60 |34.162 |3000 |878.164 |231.121 |2391.687 |2442.744 |2499.963 |89.6546 |87.8168 |17600.0000 |4100408.9560 |236.6877 |
|40 |1 |5.40 |74.40 |5270 |92.44 |34.090 |4000 |1173.373 |306.244 |3037.038 |3070.198 |3134.894 |119.4162 |117.3377 |21600.0000 |5048139.2170 |236.9326|
|50 |1 |1.40 |64.70 |5270 |92.37 |34.031 |5000 |1469.250 |384.327 |3676.812 |3784.330 |4366.862 |149.7041 |146.9254 |26600.0000 |6236269.4230 |237.6260|
|70 |1 |3.70 |79.70 |5270 |91.89 |33.976 |7000 |2060.246 |533.439 |5429.255 |5552.704 |5661.492 |210.1008 |206.0250 |33600.0000 |7905005.9940 |238.3909|
2. 异步模式 - 未开启动态批量
| client_num | batch_size |CPU_util_pre(%) |CPU_util(%) |GPU_memory(mb) |GPU_util(%) |qps(samples/s) |total count |mean(ms) |median(ms) |80 percent(ms) |90 percent(ms) |99 percent(ms) |total cost(s) |each cost(s)|infer_count_total|infer_cost_total(ms)|infer_cost_avg(ms)|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 |1 |6.20 |13.60 |5170 |71.11 |22.894 |100 |43.677 |23.992 |78.285 |78.788 |123.542 |4.4253 |4.3679 |3695.0000 |745061.9120 |40.6655 |
| 5 |1 |6.10 |32.20 |7306 |89.54 |33.532 |500 |149.109 |43.906 |376.889 |401.999 |422.753 |15.1623 |14.9113 |4184.0000 |816834.2250 |146.7736|
|10 |1 |4.90 |43.60 |7306 |91.55 |38.136 |1000 |262.216 |75.393 |575.788 |632.016 |1247.775 |27.1019 |26.2220 |5107.0000 |1026490.3950 |227.1464|
|20 |1 |5.70 |39.60 |7306 |91.36 |58.601 |2000 |341.287 |145.774 |646.824 |994.748 |1132.979 |38.3915 |34.1291 |7461.0000 |1555234.6260 |229.9113|
|30 |1 |1.30 |45.40 |7484 |91.10 |69.008 |3000 |434.728 |204.347 |959.184 |1092.181 |1661.289 |46.3822 |43.4732 |10289.0000 |2269499.9730 |249.4257|
|40 |1 |3.10 |73.00 |7562 |91.83 |80.956 |4000 |494.091 |272.889 |966.072 |1310.011 |1851.887 |52.0609 |49.4095 |12102.0000 |2678878.2010 |225.8016|
|50 |1 |0.80 |68.00 |7522 |91.10 |83.018 |5000 |602.276 |364.064 |1058.261 |1473.051 |1671.025 |72.9869 |60.2280 |14225.0000 |3256628.2820 |272.1385|
|70 |1 |6.10 |78.40 |7584 |92.02 |65.069 |7000 |1075.777 |474.014 |2411.296 |2705.863 |3409.085 |111.6653 |107.5781 |17974.0000 |4139377.4050 |235.4626
3. 异步模式 - 开启动态批量
| client_num | batch_size |CPU_util_pre(%) |CPU_util(%) |GPU_memory(mb) |GPU_util(%) |qps(samples/s) |total count |mean(ms) |median(ms) |80 percent(ms) |90 percent(ms) |99 percent(ms) |total cost(s) |each cost(s)|infer_count_total|infer_cost_total(ms)|infer_cost_avg(ms)|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 |1 |1.20 |13.30 |6048 |70.07 |22.417 |100 |44.606 |24.486 |78.365 |78.707 |139.349 |4.5201 |4.4608 |1569.0000 |462418.6390 |41.7646 |
| 5 |1 |1.20 |50.80 |7116 |87.37 |31.106 |500 |160.740 |42.506 |414.903 |458.841 |481.112 |16.3525 |16.0743 |2059.0000 |539439.3300 |157.1851
|10 |1 |0.80 |26.20 |7264 |88.74 |37.417 |1000 |267.254 |79.452 |604.451 |686.477 |1345.528 |27.9848 |26.7258 |2950.0000 |752428.0570 |239.0446|
|20 |1 |1.50 |32.80 |7264 |89.52 |70.641 |2000 |283.117 |133.441 |516.066 |652.089 |1274.957 |33.0280 |28.3121 |4805.0000 |1210814.5610 |260.5873|
|30 |1 |0.90 |59.10 |7348 |89.57 |84.894 |3000 |353.380 |217.385 |613.587 |757.829 |1277.283 |40.7093 |35.3384 |6924.0000 |1817515.1710 |276.3695|
|40 |1 |1.30 |57.30 |7356 |89.30 |99.853 |4000 |400.584 |204.425 |666.015 |1031.186 |1380.650 |49.4807 |40.0588 |8104.0000 |2200137.0060 |324.2558|
|50 |1 |1.50 |50.60 |7578 |89.04 |121.545 |5000 |411.364 |331.118 |605.809 |874.543 |1285.650 |48.2343 |41.1369 |9350.0000 |2568777.6400 |295.8593|
|70 |1 |3.80 |83.20 |7602 |89.59 |133.568 |7000 |524.073 |382.653 |799.463 |1202.179 |1576.809 |57.2885 |52.4077 |10761.0000 |3013600.9670 |315.2540|
# 加密模型预测
(简体中文|[English](./Encryption_EN.md))
Padle Serving提供了模型加密预测功能,本文档显示了详细信息。
Padle Serving 提供了模型加密预测功能,本文档显示了详细信息。
## 原理
采用对称加密算法对模型进行加密。对称加密算法采用同一密钥进行加解密,它计算量小,速度快,是最常用的加密方法。
### 获得加密模型
**一. 获得加密模型:**
普通的模型和参数可以理解为一个字符串,通过对其使用加密算法(参数是您的密钥),普通模型和参数就变成了一个加密的模型和参数。
我们提供了一个简单的演示来加密模型。请参阅[examples/C++/encryption/encrypt.py](../../examples/C++/encryption/encrypt.py)
### 启动加密服务
**二. 启动加密服务:**
假设您已经有一个已经加密的模型(在`encrypt_server/`路径下),您可以通过添加一个额外的命令行参数 `--use_encryption_model`来启动加密模型服务。
......@@ -30,7 +28,7 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
此时,服务器不会真正启动,而是等待密钥。
### Client Encryption Inference
**三. Client Encryption Inference:**
首先,您必须拥有模型加密过程中使用的密钥。
......@@ -39,5 +37,6 @@ python -m paddle_serving_server.serve --model encrypt_server/ --port 9300 --use_
一旦服务器获得密钥,它就使用该密钥解析模型并启动模型预测服务。
### 模型加密推理示例
**四. 模型加密推理示例:**
模型加密推理示例, 请参见[examples/C++/encryption/](../../examples/C++/encryption/)
# Paddle Serving中的模型热加载
(简体中文|[English](./Hot_Loading_EN.md))
# Paddle Serving 中的模型热加载
## 背景
......@@ -8,35 +6,35 @@
## Server Monitor
Paddle Serving提供了一个自动监控脚本,远端地址更新模型后会拉取新模型更新本地模型,同时更新本地模型文件夹中的时间戳文件`fluid_time_stamp`实现热加载。
Paddle Serving 提供了一个自动监控脚本,远端地址更新模型后会拉取新模型更新本地模型,同时更新本地模型文件夹中的时间戳文件 `fluid_time_stamp` 实现热加载。
目前支持下面几种类型的远端监控Monitor:
目前支持下面几种类型的远端监控 Monitor:
| Monitor类型 | 描述 | 特殊选项 |
| :---------: | :----------------------------------------------------------: | :----------------------------------------------------------: |
| general | 远端无认证,可以通过`wget`直接访问下载文件(如无需认证的FTP,BOS等) | `general_host` 通用远端host |
| hdfs/afs(HadoopMonitor) | 远端为HDFS或AFS,通过Hadoop-Client执行相关命令 | `hadoop_bin` Hadoop二进制的路径<br/>`fs_name` Hadoop fs_name,默认为空<br/>`fs_ugi` Hadoop fs_ugi,默认为空 |
| ftp | 远端为FTP,通过`ftplib`进行相关访问(使用该Monitor,您可能需要执行`pip install ftplib`下载`ftplib`) | `ftp_host` FTP host<br>`ftp_port` FTP port<br>`ftp_username` FTP username,默认为空<br>`ftp_password` FTP password,默认为空 |
| general | 远端无认证,可以通过 `wget` 直接访问下载文件(如无需认证的FTP,BOS等) | `general_host` 通用远端host |
| hdfs/afs(HadoopMonitor) | 远端为 HDFS 或 AFS,通过 Hadoop-Client 执行相关命令 | `hadoop_bin` Hadoop 二进制的路径 <br/>`fs_name` Hadoop fs_name,默认为空<br/>`fs_ugi` Hadoop fs_ugi,默认为空 |
| ftp | 远端为 FTP,通过 `ftplib` 进行相关访问(使用该 Monitor,您可能需要执行 `pip install ftplib` 下载 `ftplib`) | `ftp_host` FTP host<br>`ftp_port` FTP port<br>`ftp_username` FTP username,默认为空<br>`ftp_password` FTP password,默认为空 |
| Monitor通用选项 | 描述 | 默认值 |
| :--------------------: | :----------------------------------------------------------: | :--------------------: |
| `type` | 指定Monitor类型 | 无 |
| `type` | 指定 Monitor 类型 | 无 |
| `remote_path` | 指定远端的基础路径 | 无 |
| `remote_model_name` | 指定远端需要拉取的模型名 | 无 |
| `remote_donefile_name` | 指定远端标志模型更新完毕的donefile文件名 | 无 |
| `remote_donefile_name` | 指定远端标志模型更新完毕的 donefile 文件名 | 无 |
| `local_path` | 指定本地工作路径 | 无 |
| `local_model_name` | 指定本地模型名 | 无 |
| `local_timestamp_file` | 指定本地用于热加载的时间戳文件,该文件被认为在`local_path/local_model_name`下。 | `fluid_time_file` |
| `local_timestamp_file` | 指定本地用于热加载的时间戳文件,该文件被认为在 `local_path/local_model_name` 下。 | `fluid_time_file` |
| `local_tmp_path` | 指定本地存放临时文件的文件夹路径,若不存在则自动创建。 | `_serving_monitor_tmp` |
| `interval` | 指定轮询间隔时间,单位为秒。 | `10` |
| `unpacked_filename` | Monitor支持tarfile打包的远程模型。如果远程模型是打包格式,则需要设置该选项来告知Monitor解压后的文件名。 | `None` |
| `debug` | 如果添加`--debug`选项,则输出更详细的中间信息。 | 默认不添加该选项 |
| `unpacked_filename` | Monitor 支持 tarfile 打包的远程模型。如果远程模型是打包格式,则需要设置该选项来告知 Monitor 解压后的文件名。 | `None` |
| `debug` | 如果添加 `--debug` 选项,则输出更详细的中间信息。 | 默认不添加该选项 |
下面通过HadoopMonitor示例来展示Paddle Serving的模型热加载功能。
下面通过 HadoopMonitor 示例来展示 Paddle Serving 的模型热加载功能。
## HadoopMonitor示例
## HadoopMonitor 示例
示例中在`product_path`中生产模型上传至hdfs,在`server_path`中模拟服务端模型热加载:
示例中在 `product_path` 中生产模型上传至 hdfs,在 `server_path` 中模拟服务端模型热加载:
```shell
.
......@@ -44,9 +42,9 @@ Paddle Serving提供了一个自动监控脚本,远端地址更新模型后会
└── server_path
```
### 生产模型
**一.生产模型**
`product_path`下运行下面的Python代码生产模型(运行前需要修改hadoop相关的参数),每隔 60 秒会产出 Boston 房价预测模型的打包文件`uci_housing.tar.gz`并上传至hdfs的`/`路径下,上传完毕后更新时间戳文件`donefile`并上传至hdfs`/`路径下。
`product_path` 下运行下面的 Python 代码生产模型(运行前需要修改 hadoop 相关的参数),每隔 60 秒会产出 Boston 房价预测模型的打包文件 `uci_housing.tar.gz` 并上传至 hdfs 的`/`路径下,上传完毕后更新时间戳文件 `donefile` 并上传至 hdfs `/`路径下。
```python
import os
......@@ -121,7 +119,7 @@ for pass_id in range(30):
push_to_hdfs(donefile_name, '/')
```
hdfs上的文件如下列所示:
hdfs 上的文件如下列所示:
```bash
# hadoop fs -ls /
......@@ -130,11 +128,11 @@ Found 2 items
-rw-r--r-- 1 root supergroup 2101 2020-04-02 02:54 /uci_housing.tar.gz
```
### 服务端加载模型
**二.服务端加载模型**
进入`server_path`文件夹。
进入 `server_path` 文件夹。
#### 用初始模型启动Server
1. 用初始模型启动 Server
这里使用预训练的 Boston 房价预测模型作为初始模型:
......@@ -143,15 +141,15 @@ wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar
tar -xzf uci_housing.tar.gz
```
启动Server端:
启动 Server 端:
```shell
python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
```
#### 执行监控程序
2. 执行监控程序
用下面的命令来执行HDFS监控程序:
用下面的命令来执行 HDFS 监控程序:
```shell
python -m paddle_serving_server.monitor \
......@@ -162,7 +160,7 @@ python -m paddle_serving_server.monitor \
--local_tmp_path='_tmp' --unpacked_filename='uci_housing_model' --debug
```
上面代码通过轮询方式监控远程HDFS地址`/`的时间戳文件`/donefile`,当时间戳变更则认为远程模型已经更新,将远程打包模型`/uci_housing.tar.gz`拉取到本地临时路径`./_tmp/uci_housing.tar.gz`下,解包出模型文件`./_tmp/uci_housing_model`后,更新本地模型`./uci_housing_model`以及Paddle Serving的时间戳文件`./uci_housing_model/fluid_time_file`
上面代码通过轮询方式监控远程 HDFS 地址`/`的时间戳文件`/donefile`,当时间戳变更则认为远程模型已经更新,将远程打包模型`/uci_housing.tar.gz`拉取到本地临时路径`./_tmp/uci_housing.tar.gz`下,解包出模型文件`./_tmp/uci_housing_model`后,更新本地模型`./uci_housing_model`以及Paddle Serving的时间戳文件`./uci_housing_model/fluid_time_file`
预计输出如下:
......@@ -197,9 +195,9 @@ python -m paddle_serving_server.monitor \
2020-04-02 10:12 INFO [monitor.py:161] sleep 10s.
```
#### 查看Server日志
3. 查看 Server 日志
通过下面命令查看Server的运行日志:
通过下面命令查看 Server 的运行日志:
```shell
tail -f log/serving.INFO
......
# Inference Protocols
C++ Serving基于BRPC进行服务构建,支持BRPC、GRPC、RESTful请求。请求数据为protobuf格式,详见`core/general-server/proto/general_model_service.proto`。本文介绍构建请求以及解析结果的方法。
C++ Serving 基于 BRPC 进行服务构建,支持 BRPC、GRPC、RESTful 请求。请求数据为 protobuf 格式,详见 `core/general-server/proto/general_model_service.proto`。本文介绍构建请求以及解析结果的方法。
## Tensor
Tensor可以装载多种类型的数据,是Request和Response的基础单元。Tensor的具体定义如下:
**一.Tensor 定义**
Tensor 可以装载多种类型的数据,是 Request 和 Response 的基础单元。Tensor 的具体定义如下:
```protobuf
message Tensor {
......@@ -71,7 +73,7 @@ message Tensor {
};
```
- elem_type:数据类型,当前支持FLOAT32, INT64, INT32, UINT8, INT8, FLOAT16
- elem_type:数据类型,当前支持 FLOAT32, INT64, INT32, UINT8, INT8, FLOAT16
|elem_type|类型|
|---------|----|
......@@ -86,10 +88,12 @@ message Tensor {
|8|INT8|
- shape:数据维度
- lod:lod信息,LoD(Level-of-Detail) Tensor是Paddle的高级特性,是对Tensor的一种扩充,用于支持更自由的数据输入。详见[LOD](../LOD_CN.md)
- lod:lod 信息,LoD(Level-of-Detail) Tensor 是 Paddle 的高级特性,是对 Tensor 的一种扩充,用于支持更自由的数据输入。Lod 相关原理介绍,请参考[相关文档](../LOD_CN.md)
- name/alias_name: 名称及别名,与模型配置对应
### 构建FLOAT32数据Tensor
**二.构建 Tensor 数据**
1. FLOAT32 类型 Tensor
```C
// 原始数据
......@@ -99,7 +103,7 @@ Tensor *tensor = new Tensor;
for (uint32_t j = 0; j < float_shape.size(); ++j) {
tensor->add_shape(float_shape[j]);
}
// 设置LOD信息
// 设置 LOD 信息
for (uint32_t j = 0; j < float_lod.size(); ++j) {
tensor->add_lod(float_lod[j]);
}
......@@ -113,7 +117,7 @@ tensor->mutable_float_data()->Resize(total_number, 0);
memcpy(tensor->mutable_float_data()->mutable_data(), float_datadata(), total_number * sizeof(float));
```
### 构建INT8数据Tensor
2. INT8 类型 Tensor
```C
// 原始数据
......@@ -133,7 +137,9 @@ tensor->set_tensor_content(string_data);
## Request
Request为客户端需要发送的请求数据,其以Tensor为基础数据单元,并包含了额外的请求信息。定义如下:
**一.Request 定义**
Request 为客户端需要发送的请求数据,其以 Tensor 为基础数据单元,并包含了额外的请求信息。定义如下:
```protobuf
message Request {
......@@ -148,9 +154,11 @@ message Request {
- profile_server: 调试参数,打开时会输出性能信息
- log_id: 请求ID
### 构建Request
**二.构建 Request**
当使用BRPC或GRPC进行请求时,使用protobuf形式数据,构建方式如下:
1. Protobuf 形式
当使用 BRPC 或 GRPC 进行请求时,使用 protobuf 形式数据,构建方式如下:
```C
Request req;
......@@ -162,16 +170,19 @@ for (auto &name : fetch_name) {
Tensor *tensor = req.add_tensor();
...
```
2. Json 形式
当使用RESTful请求时,可以使用JSON形式数据,具体格式如下:
当使用 RESTful 请求时,可以使用 Json 形式数据,具体格式如下:
```JSON
```Json
{"tensor":[{"float_data":[0.0137,-0.1136,0.2553,-0.0692,0.0582,-0.0727,-0.1583,-0.0584,0.6283,0.4919,0.1856,0.0795,-0.0332],"elem_type":1,"name":"x","alias_name":"x","shape":[1,13]}],"fetch_var_names":["price"],"log_id":0}
```
## Response
Response为服务端返回给客户端的结果,包含了Tensor数据、错误码、错误信息等。定义如下:
**一.Response 定义**
Response 为服务端返回给客户端的结果,包含了 Tensor 数据、错误码、错误信息等。定义如下:
```protobuf
message Response {
......@@ -190,7 +201,7 @@ message ModelOutput {
}
```
- profile_time:当设置request->set_profile_server(true)时,会返回性能信息
- profile_time:当设置 request->set_profile_server(true) 时,会返回性能信息
- err_no:错误码,详见`core/predictor/common/constant.h`
- err_msg:错误信息,详见`core/predictor/common/constant.h`
- engine_name:输出节点名称
......@@ -203,19 +214,19 @@ message ModelOutput {
|-5002|"Paddle Serving Array Overflow Error."|
|-5100|"Paddle Serving Op Inference Error."|
### 读取Response数据
**二.读取 Response 数据**
```C
uint32_t model_num = res.outputs_size();
for (uint32_t m_idx = 0; m_idx < model_num; ++m_idx) {
std::string engine_name = output.engine_name();
int idx = 0;
// 读取tensor维度
// 读取 tensor 维度
int shape_size = output.tensor(idx).shape_size();
for (int i = 0; i < shape_size; ++i) {
shape[i] = output.tensor(idx).shape(i);
}
// 读取LOD信息
// 读取 LOD 信息
int lod_size = output.tensor(idx).lod_size();
if (lod_size > 0) {
lod.resize(lod_size);
......@@ -223,12 +234,12 @@ for (uint32_t m_idx = 0; m_idx < model_num; ++m_idx) {
lod[i] = output.tensor(idx).lod(i);
}
}
// 读取float数据
// 读取 float 数据
int size = output.tensor(idx).float_data_size();
float_data = std::vector<float>(
output.tensor(idx).float_data().begin(),
output.tensor(idx).float_data().begin() + size);
// 读取int8数据
// 读取 int8 数据
string_data = output.tensor(idx).tensor_content();
}
```
\ No newline at end of file
```
# Paddle Serving中的集成预测
(简体中文|[English](./Model_Ensemble_EN.md))
在一些场景中,可能使用多个相同输入的模型并行集成预测以获得更好的预测效果,Paddle Serving提供了这项功能。
下面将以文本分类任务为例,来展示Paddle Serving的集成预测功能(暂时还是串行预测,我们会尽快支持并行化)。
## 集成预测样例
该样例中(见下图),Server端在一项服务中并行预测相同输入的BOW和CNN模型,Client端获取两个模型的预测结果并进行后处理,得到最终的预测结果。
![simple example](../images/model_ensemble_example.png)
需要注意的是,目前只支持在同一个服务中使用多个相同格式输入输出的模型。在该例子中,CNN模型和BOW模型的输入输出格式是相同的。
样例中用到的代码保存在`examples/C++/imdb`路径下:
```shell
.
├── get_data.sh
├── imdb_reader.py
├── test_ensemble_client.py
└── test_ensemble_server.py
# 如何使用 C++ 定义模型组合
如果您的模型处理过程包含一个以上的模型推理环节(例如 OCR 一般需要 det+rec 两个环节),此时有两种做法可以满足您的需求。
1. 启动两个 Serving 服务(例如 Serving-det, Serving-rec),在您的 Client 中,读入数据——>det 前处理——>调用 Serving-det 预测——>det 后处理——>rec 前处理——>调用 Serving-rec 预测——>rec 后处理——>输出结果。
- 优点:无须改动 Paddle Serving 代码
- 缺点:需要两次请求服务,请求数据量越大,效率稍差。
2. 通过修改代码,自定义模型预测行为(自定义 OP),自定义服务处理的流程(自定义 DAG),将多个模型的组合处理过程(上述的 det 前处理——>调用 Serving-det 预测——>det 后处理——>rec 前处理——>调用 Serving-rec 预测——>rec 后处理)集成在一个 Serving 服务中。此时,在您的 Client 中,读入数据——>调用集成后的 Serving——>输出结果。
- 优点:只需要一次请求服务,效率高。
- 缺点:需要改动代码,且需要重新编译。
本文主要介绍自定义服务处理流程的方法,该方法的基本步骤如下:
1. 自定义 OP(即定义单个模型的前处理-模型预测-模型后处理)
2. 编译
3. 服务启动与调用
## 自定义 OP
一个 OP 定义了单个模型的前处理-模型预测-模型后处理,定义 OP 需要以下 2 步:
1. 定义 C++.h 头文件
2. 定义 C++.cpp 源文件
**一. 定义 C++.h 头文件**
复制下方的代码,将其中`/*自定义 Class 名称*/`更换为自定义的类名即可,如 `GeneralDetectionOp`
放置于 `core/general-server/op/` 路径下,文件名自定义即可,如 `general_detection_op.h`
``` C++
#pragma once
#include <string>
#include <vector>
#include "core/general-server/general_model_service.pb.h"
#include "core/general-server/op/general_infer_helper.h"
#include "paddle_inference_api.h" // NOLINT
namespace baidu {
namespace paddle_serving {
namespace serving {
class /*自定义Class名称*/
: public baidu::paddle_serving::predictor::OpWithChannel<GeneralBlob> {
public:
typedef std::vector<paddle::PaddleTensor> TensorVector;
DECLARE_OP(/*自定义Class名称*/);
int inference();
};
} // namespace serving
} // namespace paddle_serving
} // namespace baidu
```
### 数据准备
通过下面命令获取预训练的CNN和BOW模型(您也可以直接运行`get_data.sh`脚本):
```shell
wget --no-check-certificate https://fleet.bj.bcebos.com/text_classification_data.tar.gz
wget --no-check-certificate https://paddle-serving.bj.bcebos.com/imdb-demo/imdb_model.tar.gz
tar -zxvf text_classification_data.tar.gz
tar -zxvf imdb_model.tar.gz
**二. 定义 C++.cpp 源文件**
复制下方的代码,将其中`/*自定义 Class 名称*/`更换为自定义的类名,如 `GeneralDetectionOp`
将前处理和后处理的代码添加在下方的代码中注释的前处理和后处理的位置。
放置于 `core/general-server/op/` 路径下,文件名自定义即可,如 `general_detection_op.cpp`
``` C++
#include "core/general-server/op/自定义的头文件名"
#include <algorithm>
#include <iostream>
#include <memory>
#include <sstream>
#include "core/predictor/framework/infer.h"
#include "core/predictor/framework/memory.h"
#include "core/predictor/framework/resource.h"
#include "core/util/include/timer.h"
namespace baidu {
namespace paddle_serving {
namespace serving {
using baidu::paddle_serving::Timer;
using baidu::paddle_serving::predictor::MempoolWrapper;
using baidu::paddle_serving::predictor::general_model::Tensor;
using baidu::paddle_serving::predictor::general_model::Response;
using baidu::paddle_serving::predictor::general_model::Request;
using baidu::paddle_serving::predictor::InferManager;
using baidu::paddle_serving::predictor::PaddleGeneralModelConfig;
int /*自定义Class名称*/::inference() {
//获取前置OP节点
const std::vector<std::string> pre_node_names = pre_names();
if (pre_node_names.size() != 1) {
LOG(ERROR) << "This op(" << op_name()
<< ") can only have one predecessor op, but received "
<< pre_node_names.size();
return -1;
}
const std::string pre_name = pre_node_names[0];
//将前置OP的输出,作为本OP的输入。
GeneralBlob *input_blob = mutable_depend_argument<GeneralBlob>(pre_name);
if (!input_blob) {
LOG(ERROR) << "input_blob is nullptr,error";
return -1;
}
TensorVector *in = &input_blob->tensor_vector;
uint64_t log_id = input_blob->GetLogId();
int batch_size = input_blob->_batch_size;
//初始化本OP的输出。
GeneralBlob *output_blob = mutable_data<GeneralBlob>();
output_blob->SetLogId(log_id);
output_blob->_batch_size = batch_size;
VLOG(2) << "(logid=" << log_id << ") infer batch size: " << batch_size;
TensorVector *out = &output_blob->tensor_vector;
//前处理的代码添加在此处,前处理直接修改上文的TensorVector* in
//注意in里面的数据是前置节点的输出经过后处理后的out中的数据
Timer timeline;
int64_t start = timeline.TimeStampUS();
timeline.Start();
// 将前处理后的in,初始化的out传入,进行模型预测,模型预测的输出会直接修改out指向的内存中的数据
// 如果您想定义一个不需要模型调用,只进行数据处理的OP,删除下面这一部分的代码即可。
if (InferManager::instance().infer(
engine_name().c_str(), in, out, batch_size)) {
LOG(ERROR) << "(logid=" << log_id
<< ") Failed do infer in fluid model: " << engine_name().c_str();
return -1;
}
//后处理的代码添加在此处,后处理直接修改上文的TensorVector* out
//后处理后的out会被传递给后续的节点
int64_t end = timeline.TimeStampUS();
CopyBlobInfo(input_blob, output_blob);
AddBlobInfo(output_blob, start);
AddBlobInfo(output_blob, end);
return 0;
}
DEFINE_OP(/*自定义Class名称*/);
} // namespace serving
} // namespace paddle_serving
} // namespace baidu
```
### 启动Server
通过下面的Python代码启动Server端(您也可以直接运行`test_ensemble_server.py`脚本):
```python
from paddle_serving_server import OpMaker
from paddle_serving_server import OpGraphMaker
from paddle_serving_server import Server
op_maker = OpMaker()
read_op = op_maker.create('GeneralReaderOp')
cnn_infer_op = op_maker.create(
'GeneralInferOp', engine_name='cnn', inputs=[read_op])
bow_infer_op = op_maker.create(
'GeneralInferOp', engine_name='bow', inputs=[read_op])
response_op = op_maker.create(
'GeneralResponseOp', inputs=[cnn_infer_op, bow_infer_op])
op_graph_maker = OpGraphMaker()
op_graph_maker.add_op(read_op)
op_graph_maker.add_op(cnn_infer_op)
op_graph_maker.add_op(bow_infer_op)
op_graph_maker.add_op(response_op)
server = Server()
server.set_op_graph(op_graph_maker.get_op_graph())
model_config = {cnn_infer_op: 'imdb_cnn_model', bow_infer_op: 'imdb_bow_model'}
server.load_model_config(model_config)
server.prepare_server(workdir="work_dir1", port=9393, device="cpu")
server.run_server()
1. TensorVector数据结构
TensorVector* in 和 out 都是一个 TensorVector 类型的指针,其使用方法跟 Paddle C++ API 中的 Tensor 几乎一样,相关的数据结构如下所示
``` C++
//TensorVector
typedef std::vector<paddle::PaddleTensor> TensorVector;
//paddle::PaddleTensor
struct PD_INFER_DECL PaddleTensor {
PaddleTensor() = default;
std::string name; ///< variable name.
std::vector<int> shape;
PaddleBuf data; ///< blob of data.
PaddleDType dtype;
std::vector<std::vector<size_t>> lod; ///< Tensor+LoD equals LoDTensor
};
//PaddleBuf
class PD_INFER_DECL PaddleBuf {
public:
explicit PaddleBuf(size_t length)
: data_(new char[length]), length_(length), memory_owned_(true) {}
PaddleBuf(void* data, size_t length)
: data_(data), length_(length), memory_owned_{false} {}
explicit PaddleBuf(const PaddleBuf& other);
void Resize(size_t length);
void Reset(void* data, size_t length);
bool empty() const { return length_ == 0; }
void* data() const { return data_; }
size_t length() const { return length_; }
~PaddleBuf() { Free(); }
PaddleBuf& operator=(const PaddleBuf&);
PaddleBuf& operator=(PaddleBuf&&);
PaddleBuf() = default;
PaddleBuf(PaddleBuf&& other);
private:
void Free();
void* data_{nullptr}; ///< pointer to the data memory.
size_t length_{0}; ///< number of memory bytes.
bool memory_owned_{true};
};
```
与普通预测服务不同的是,这里我们需要用DAG来描述Server端的运行逻辑。
2. TensorVector 代码示例
```C++
/*例如,你想访问输入数据中的第1个Tensor*/
paddle::PaddleTensor& tensor_1 = in->at(0);
/*例如,你想修改输入数据中的第1个Tensor的名称*/
tensor_1.name = "new name";
/*例如,你想获取输入数据中的第1个Tensor的shape信息*/
std::vector<int> tensor_1_shape = tensor_1.shape;
/*例如,你想修改输入数据中的第1个Tensor中的数据*/
void* data_1 = tensor_1.data.data();
//后续直接修改data_1指向的内存即可
//比如,当您的数据是int类型,将void*转换为int*进行处理即可
```
在创建Op的时候需要指定当前Op的前继(在该例子中,`cnn_infer_op``bow_infer_op`的前继均是`read_op``response_op`的前继是`cnn_infer_op``bow_infer_op`),对于预测Op`infer_op`还需要定义预测引擎名称`engine_name`(也可以使用默认值,建议设置该值方便Client端获取预测结果)。
同时在配置模型路径时,需要以预测Op为key,对应的模型路径为value,创建模型配置字典,来告知Serving每个预测Op使用哪个模型。
## 修改后编译
此时,需要您重新编译生成 serving,并通过 `export SERVING_BIN` 设置环境变量来指定使用您编译生成的 serving 二进制文件,并通过 `pip3 install` 的方式安装相关 python 包,细节请参考[如何编译Serving](2-3_Compile_CN.md)
### 启动Client
## 服务启动与调用
通过下面的Python代码运行Client端(您也可以直接运行`test_ensemble_client.py`脚本):
**一. Server 端启动**
在前面两个小节工作做好的基础上,一个服务启动两个模型串联,只需要在`--model 后依次按顺序传入模型文件夹的相对路径`,且需要在`--op 后依次传入自定义 C++OP 类名称`,其中--model 后面的模型与--op 后面的类名称的顺序需要对应,`这里假设我们已经定义好了两个 OP 分别为 GeneralDetectionOp 和 GeneralRecOp`,则脚本代码如下:
```python
from paddle_serving_client import Client
from imdb_reader import IMDBDataset
client = Client()
# If you have more than one model, make sure that the input
# and output of more than one model are the same.
client.load_client_config('imdb_bow_client_conf/serving_client_conf.prototxt')
client.connect(["127.0.0.1:9393"])
# you can define any english sentence or dataset here
# This example reuses imdb reader in training, you
# can define your own data preprocessing easily.
imdb_dataset = IMDBDataset()
imdb_dataset.load_resource('imdb.vocab')
for i in range(3):
line = 'i am very sad | 0'
word_ids, label = imdb_dataset.get_words_and_label(line)
feed = {"words": word_ids}
fetch = ["acc", "cost", "prediction"]
fetch_maps = client.predict(feed=feed, fetch=fetch)
if len(fetch_maps) == 1:
print("step: {}, res: {}".format(i, fetch_maps['prediction'][0][1]))
else:
for model, fetch_map in fetch_maps.items():
print("step: {}, model: {}, res: {}".format(i, model, fetch_map[
'prediction'][0][1]))
#一个服务启动多模型串联
python3 -m paddle_serving_server.serve --model ocr_det_model ocr_rec_model --op GeneralDetectionOp GeneralRecOp --port 9292
#多模型串联 ocr_det_model 对应 GeneralDetectionOp ocr_rec_model 对应 GeneralRecOp
```
Client端与普通预测服务没有发生太大的变化。当使用多个模型预测时,预测服务将返回一个key为Server端定义的引擎名称`engine_name`,value为对应的模型预测结果的字典。
**二. Client 端调用**
### 预期结果
```txt
step: 0, model: cnn, res: 0.560272455215
step: 0, model: bow, res: 0.633530199528
step: 1, model: cnn, res: 0.560272455215
step: 1, model: bow, res: 0.633530199528
step: 2, model: cnn, res: 0.560272455215
step: 2, model: bow, res: 0.633530199528
此时,Client 端的调用,也需要传入两个 Client 端的 proto 文件或文件夹的路径,以 OCR 为例,可以参考[ocr_cpp_client.py](../../examples/C++/PaddleOCR/ocr/ocr_cpp_client.py)来自行编写您的脚本,此时 Client 调用如下:
```python
#一个服务启动多模型串联
python3 自定义.py ocr_det_client ocr_rec_client
#ocr_det_client为第一个模型的Client端proto文件夹的相对路径
#ocr_rec_client为第二个模型的Client端proto文件夹的相对路径
```
此时,对于 Server 端而言,输入的数据的格式与`第一个模型的 Client 端 proto 格式`定义的一致,输出的数据格式与`最后一个模型的 Client 端 proto`文件一致。一般情况下您无须关注此事,当您需要了解详细的proto的定义,请参考[Serving 配置](5-3_Serving_Configure_CN.md)
# 如何开发一个新的General Op?
(简体中文|[English](./OP_EN.md))
- [定义一个Op](#1)
- [在Op之间使用 `GeneralBlob`](#2)
- [2.1 实现 `int Inference()`](#2.1)
- [定义 Python API](#3)
在本文档中,我们主要集中于如何为Paddle Serving开发新的服务器端运算符。 在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。 我们假设您已经知道Paddle Serving服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
在本文档中,我们主要集中于如何为 Paddle Serving 开发新的服务器端运算符。在开始编写新运算符之前,让我们看一些示例代码以获得为服务器编写新运算符的基本思想。我们假设您已经知道 Paddle Serving 服务器端的基本计算逻辑。 下面的代码您可以在 Serving代码库下的 `core/general-server/op` 目录查阅。
``` c++
// Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#pragma once
#include <string>
#include <vector>
#ifdef BCLOUD
#ifdef WITH_GPU
#include "paddle/paddle_inference_api.h"
#else
#include "paddle/fluid/inference/api/paddle_inference_api.h"
#endif
#else
#include "paddle_inference_api.h" // NOLINT
#endif
#include "core/general-server/general_model_service.pb.h"
#include "core/general-server/op/general_infer_helper.h"
......@@ -54,14 +36,17 @@ class GeneralInferOp
} // namespace paddle_serving
} // namespace baidu
```
<a name="1"></a>
## 定义一个Op
上面的头文件声明了一个名为`GeneralInferOp`的PaddleServing运算符。 在运行时,将调用函数 `int inference()`。 通常,我们将服务器端运算符定义为baidu::paddle_serving::predictor::OpWithChannel的子类,并使用 `GeneralBlob` 数据结构。
上面的头文件声明了一个名为 `GeneralInferOp` 的 Paddle Serving 运算符。 在运行时,将调用函数 `int inference()`。 通常,我们将服务器端运算符定义为baidu::paddle_serving::predictor::OpWithChannel 的子类,并使用 `GeneralBlob` 数据结构。
<a name="2"></a>
## 在Op之间使用 `GeneralBlob`
`GeneralBlob` 是一种可以在服务器端运算符之间使用的数据结构。 `tensor_vector``GeneralBlob`中最重要的数据结构。 服务器端的操作员可以将多个`paddle::PaddleTensor`作为输入,并可以将多个`paddle::PaddleTensor`作为输出。 特别是,`tensor_vector`可以在没有内存拷贝的操作下输入到Paddle推理引擎中。
`GeneralBlob` 是一种可以在服务器端运算符之间使用的数据结构。 `tensor_vector``GeneralBlob` 中最重要的数据结构。 服务器端的操作员可以将多个 `paddle::PaddleTensor` 作为输入,并可以将多个 `paddle::PaddleTensor `作为输出。 特别是,`tensor_vector` 可以在没有内存拷贝的操作下输入到 Paddle 推理引擎中。
``` c++
struct GeneralBlob {
......@@ -86,7 +71,9 @@ struct GeneralBlob {
};
```
### 实现 `int Inference()`
<a name="2.1"></a>
**一. 实现 `int Inference()`**
``` c++
int GeneralInferOp::inference() {
......@@ -127,14 +114,13 @@ int GeneralInferOp::inference() {
DEFINE_OP(GeneralInferOp);
```
`input_blob``output_blob` 都有很多的 `paddle::PaddleTensor`, 且Paddle预测库会被 `InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)`调用。此函数中的其他大多数代码都与性能分析有关,将来我们也可能会删除多余的代码。
`input_blob``output_blob` 都有很多的 `paddle::PaddleTensor`, 且 Paddle 预测库会被 `InferManager::instance().infer(engine_name().c_str(), in, out, batch_size)` 调用。此函数中的其他大多数代码都与性能分析有关,将来我们也可能会删除多余的代码。
基本上,以上代码可以实现一个新的运算符。如果您想访问字典资源,可以参考`core/predictor/framework/resource.cpp`来添加全局可见资源。资源的初始化在启动服务器的运行时执行。
<a name="3"></a>
## 定义 Python API
在服务器端为Paddle Serving定义C++运算符后,最后一步是在Python API中为Paddle Serving服务器API添加注册, `python/paddle_serving_server/dag.py`文件里有关于API注册的代码如下
在服务器端为 Paddle Serving 定义 C++ 运算符后,最后一步是在 Python API 中为 Paddle Serving 服务器 API 添加注册, `python/paddle_serving_server/dag.py` 文件里有关于 API 注册的代码如下
``` python
......@@ -152,7 +138,7 @@ self.op_list = [
]
```
`python/paddle_serving_server/server.py`文件中仅添加`需要加载模型,执行推理预测的自定义的C++OP类的类名`。例如`GeneralReaderOp`由于只是做一些简单的数据处理而不加载模型调用预测,故在👆的代码中需要添加,而不添加在👇的代码中。
`python/paddle_serving_server/server.py` 文件中仅添加`需要加载模型,执行推理预测的自定义的 C++ OP 类的类名`。例如 `GeneralReaderOp` 由于只是做一些简单的数据处理而不加载模型调用预测,故在上述的代码中需要添加,而不添加在下方的代码中。
``` python
default_engine_types = [
'GeneralInferOp',
......
# C++ Serving性能分析与优化
# 1.背景知识介绍
## 背景知识介绍
1) 首先,应确保您知道C++ Serving常用的一些[功能特点](./Introduction_CN.md)[C++ Serving 参数配置和启动的详细说明](../Serving_Configure_CN.md)
2) 关于C++ Serving框架本身的性能分析和介绍,请参考[C++ Serving框架性能测试](./Frame_Performance_CN.md)
3) 您需要对您使用的模型、机器环境、需要部署上线的业务有一些了解,例如,您使用CPU还是GPU进行预测;是否可以开启TRT进行加速;你的机器CPU是多少core的;您的业务包含几个模型;每个模型的输入和输出需要做些什么处理;您业务的最大线上流量是多少;您的模型支持的最大输入batch是多少等等.
......
# Request Cache
# 请求缓存
本文主要介绍请求缓存功能及实现原理。
服务中请求由张量tensor、结果名称fetch_var_names、调试开关profile_server、标识码log_id组成,预测结果包含输出张量等。这里缓存会保存请求与结果的键值对。当请求命中缓存时,服务不会执行模型预测,而是会直接从缓存中提取结果。对于某些特定场景而言,这能显著降低请求耗时。
## 基本原理
缓存可以通过设置`--request_cache_size`来开启。该标志默认为0,即不开启缓存。当设置非零值时,服务会以设置大小为存储上限开启缓存。这里设置的内存单位为字节。注意,如果设置`--request_cache_size`为0是不能开启缓存的
服务中请求由张量 tensor、结果名称 fetch_var_names、调试开关 profile_server、标识码 log_id 组成,预测结果包含输出张量 tensor 等。这里缓存会保存请求与结果的键值对。当请求命中缓存时,服务不会执行模型预测,而是会直接从缓存中提取结果。对于某些特定场景而言,这能显著降低请求耗时
缓存中的键为64位整形数,是由请求中的tensor和fetch_var_names数据生成的128位哈希值。如果请求命中,那么对应的处理结果会提取出来用于构建响应数据。如果请求没有命中,服务则会执行模型预测,在返回结果的同时将处理结果放入缓存中。由于缓存设置了存储上限,因此需要淘汰机制来限制缓存容量。当前,服务采用了最近最少使用(LRU)机制用于淘汰缓存数据。
缓存可以通过设置`--request_cache_size`来开启。该标志默认为 0,即不开启缓存。当设置非零值时,服务会以设置大小为存储上限开启缓存。这里设置的内存单位为字节。注意,如果设置`--request_cache_size`为 0 是不能开启缓存的。
缓存中的键为 64 位整形数,是由请求中的 tensor 和 fetch_var_names 数据生成的 64 位哈希值。如果请求命中,那么对应的处理结果会提取出来用于构建响应数据。如果请求没有命中,服务则会执行模型预测,在返回结果的同时将处理结果放入缓存中。由于缓存设置了存储上限,因此需要淘汰机制来限制缓存容量。当前,服务采用了最近最少使用(LRU)机制用于淘汰缓存数据。
## 注意事项
- 只有预测成功的请求会进行缓存。如果请求失败或者在预测过程中返回错误,则处理结果不会缓存。
- 缓存是基于请求数据的哈希值实现。因此,可能会出现两个不同的请求生成了相同的哈希值即哈希碰撞,这时服务可能会返回错误的响应数据。哈希值为64位数据,发生哈希碰撞的可能性较小。
- 缓存是基于请求数据的哈希值实现。因此,可能会出现两个不同的请求生成了相同的哈希值即哈希碰撞,这时服务可能会返回错误的响应数据。哈希值为 64 位数据,发生哈希碰撞的可能性较小。
- 不论使用同步模式还是异步模式,均可以正常使用缓存功能。
......@@ -38,18 +38,17 @@
推荐使用Docker编译,我们已经为您准备好了Paddle Serving编译环境并配置好了上述编译依赖,详见[该文档](Docker_Images_CN.md)
我们提供了五个环境的开发镜像,分别是CPU, CUDA10.1+CUDNN7, CUDA10.2+CUDNN7,CUDA10.2+CUDNN8, CUDA11.2+CUDNN8。我们提供了Serving开发镜像涵盖以上环境。与此同时,我们也支持Paddle开发镜像。
我们提供了五个环境的开发镜像,分别是CPU、 CUDA10.1+CUDNN7、CUDA10.2+CUDNN8、 CUDA11.2+CUDNN8。我们提供了Serving开发镜像涵盖以上环境。与此同时,我们也支持Paddle开发镜像。
Serving开发镜像是Serving套件为了支持各个预测环境提供的用于编译、调试预测服务的镜像,Paddle开发镜像是Paddle在官网发布的用于编译、开发、训练模型使用镜像。为了让Paddle开发者能够在同一个容器内直接使用Serving。对于上个版本就已经使用Serving用户的开发者来说,Serving开发镜像应该不会感到陌生。但对于熟悉Paddle训练框架生态的开发者,目前应该更熟悉已有的Paddle开发镜像。为了适应所有用户的不同习惯,我们对这两套镜像都做了充分的支持。
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.8.0-devel | Ubuntu 16.04 | 2.2.2 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.8.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN7 | 0.8.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda10.2-cudnn7 | Ubuntu 16.04 |
| CUDA10.2 + CUDNN8 | 0.8.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA11.2 + CUDNN8 | 0.8.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
| CPU | 0.9.0-devel | Ubuntu 16.04 | 2.3.0 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.9.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN8 | 0.9.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA11.2 + CUDNN8 | 0.9.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.3.0-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
我们首先要针对自己所需的环境拉取相关镜像。上表**环境**一列下,除了CPU,其余(Cuda**+Cudnn**)都属于GPU环境。
您可以使用Serving开发镜像。
......
......@@ -37,17 +37,16 @@ In addition, for some C++ secondary development scenarios, we also provide OPENC
Docker compilation is recommended. We have prepared the Paddle Serving compilation environment for you and configured the above compilation dependencies. For details, please refer to [this document](DOCKER_IMAGES_CN.md).
We provide five environment development images, namely CPU, CUDA10.1 + CUDNN7, CUDA10.2 + CUDNN7, CUDA10.2 + CUDNN8, CUDA11.2 + CUDNN8. We provide a Serving development image to cover the above environment. At the same time, we also support Paddle development mirroring.
We provide 4 environment development images, namely CPU, CUDA10.2 + CUDNN7, CUDA10.2 + CUDNN8, CUDA11.2 + CUDNN8. We provide a Serving development image to cover the above environment. At the same time, we also support Paddle development mirroring.
Serving development mirror is the mirror used to compile and debug prediction services provided by Serving suite in order to support various prediction environments. Paddle development mirror is the mirror used for compilation, development, and training models released by Paddle on the official website. In order to allow Paddle developers to use Serving directly in the same container. For developers who have already used Serving users in the previous version, Serving development image should not be unfamiliar. But for developers who are familiar with the Paddle training framework ecology, they should be more familiar with the existing Paddle development mirrors. In order to adapt to the different habits of all users, we have fully supported both sets of mirrors.
| Environment | Serving Dev Image Tag | OS | Paddle Dev Image Tag | OS |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.8.0-devel | Ubuntu 16.04 | 2.2.2 | Ubuntu 18.04. |
| CUDA10.1 + Cudnn7 | 0.8.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | Nan | Nan |
| CUDA10.2 + Cudnn7 | 0.8.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda10.2-cudnn7 | Ubuntu 16.04 |
| CUDA10.2 + Cudnn8 | 0.8.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | Nan | Nan |
| CUDA11.2 + Cudnn8 | 0.8.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
| CPU | 0.9.0-devel | Ubuntu 16.04 | 2.3.0 | Ubuntu 18.04. |
| CUDA10.1 + Cudnn7 | 0.9.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | Nan | Nan |
| CUDA10.2 + Cudnn8 | 0.9.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | Nan | Nan |
| CUDA11.2 + Cudnn8 | 0.9.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.3.0-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
We first need to pull related images for the environment we need. Under the **Environment** column in the above table, except for the CPU, the rest (Cuda**+Cudnn**) belong to the GPU environment.
......
......@@ -26,7 +26,7 @@
## 镜像说明
若需要基于源代码二次开发编译,请使用后缀为-devel的版本。
**在TAG列,0.8.0也可以替换成对应的版本号,例如0.5.0/0.4.1等,但需要注意的是,部分开发环境随着某个版本迭代才增加,因此并非所有环境都有对应的版本号可以使用。**
**在TAG列,0.9.0也可以替换成对应的版本号,例如0.5.0/0.4.1等,但需要注意的是,部分开发环境随着某个版本迭代才增加,因此并非所有环境都有对应的版本号可以使用。**
**开发镜像:**
......@@ -34,12 +34,11 @@
| 镜像选择 | 操作系统 | TAG | Dockerfile |
| :----------------------------------------------------------: | :-----: | :--------------------------: | :----------------------------------------------------------: |
| CPU development | Ubuntu16 | 0.8.0-devel | [Dockerfile.devel](../tools/Dockerfile.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | 0.8.0-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | 0.8.0-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn7-tensorRT6) development | Ubuntu16 | 0.8.0-cuda10.2-cudnn7-devel | [Dockerfile.cuda10.2-cudnn7.devel](../tools/Dockerfile.cuda10.2-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | 0.8.0-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11.2-cudnn8-tensorRT8) development | Ubuntu16 | 0.8.0-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
| CPU development | Ubuntu16 | 0.9.0-devel | [Dockerfile.devel](../tools/Dockerfile.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | 0.9.0-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn7-tensorRT6) development | Ubuntu16 | 0.9.0-cuda10.2-cudnn7-devel | [Dockerfile.cuda10.2-cudnn7.devel](../tools/Dockerfile.cuda10.2-cudnn7.devel)
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | 0.9.0-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11.2-cudnn8-tensorRT8) development | Ubuntu16 | 0.9.0-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**运行镜像:**
......@@ -48,15 +47,16 @@
| Env | Version | Docker images tag | OS | Gcc Version | Size |
|----------|---------|------------------------------|-----------|-------------|------|
| CPU | 0.8.0 | 0.8.0-runtime | Ubuntu 16 | 8.2.0 | 3.9 GB |
| Cuda10.1 | 0.8.0 | 0.8.0-cuda10.1-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10 GB |
| Cuda10.2 | 0.8.0 | 0.8.0-cuda10.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 10.1 GB |
| Cuda11.2 | 0.8.0 | 0.8.0-cuda11.2-cudnn8-runtime| Ubuntu 16 | 8.2.0 | 14.2 GB |
| CPU | 0.9.0 | 0.9.0-runtime | Ubuntu 16 | 8.2.0 | 3.9 GB |
| CUDA 10.1 + cuDNN 7 | 0.9.0 | 0.9.0-cuda10.1-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10 GB |
| CUDA 10.2 + cuDNN 7 | 0.9.0 | 0.9.0-cuda10.2-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10.1 GB |
| CUDA 10.2 + cuDNN 8 | 0.9.0 | 0.9.0-cuda10.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 10.1 GB |
| CUDA 11.2 + cuDNN 8 | 0.9.0 | 0.9.0-cuda11.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 14.2 GB |
**Java镜像:**
```
registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-java
registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda10.2-cudnn8-java
```
**XPU镜像:**
......
......@@ -28,10 +28,8 @@ You can get images in two ways:
If you want to customize your Serving based on source code, use the version with the suffix - devel.
**cuda10.1-cudnn7-gcc54 image is not ready, you should run from dockerfile if you need it.**
If you need to develop and compile based on the source code, please use the version with the suffix -devel.
**In the TAG column, 0.8.0 can also be replaced with the corresponding version number, such as 0.5.0/0.4.1, etc., but it should be noted that some development environments only increase with a certain version iteration, so not all environments All have the corresponding version number can be used.**
**In the TAG column, 0.9.0 can also be replaced with the corresponding version number, such as 0.5.0/0.4.1, etc., but it should be noted that some development environments only increase with a certain version iteration, so not all environments All have the corresponding version number can be used.**
**Development Docker Images:**
......@@ -39,12 +37,11 @@ A variety of development tools are installed in the development image, which can
| Description | OS | TAG | Dockerfile |
| :----------------------------------------------------------: | :-----: | :--------------------------: | :----------------------------------------------------------: |
| CPU development | Ubuntu16 | 0.8.0-devel | [Dockerfile.devel](../tools/Dockerfile.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6-gcc54) development | Ubuntu16 | 0.8.0-cuda10.1-cudnn7-gcc54-devel (not ready) | [Dockerfile.cuda10.1-cudnn7-gcc54.devel](../tools/Dockerfile.cuda10.1-cudnn7-gcc54.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | 0.8.0-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn7-tensorRT6) development | Ubuntu16 | 0.8.0-cuda10.2-cudnn7-devel | [Dockerfile.cuda10.2-cudnn7.devel](../tools/Dockerfile.cuda10.2-cudnn7.devel) |
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | 0.8.0-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11.2-cudnn8-tensorRT8) development | Ubuntu16 | 0.8.0-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
| CPU development | Ubuntu16 | 0.9.0-devel | [Dockerfile.devel](../tools/Dockerfile.devel) |
| GPU (cuda10.1-cudnn7-tensorRT6) development | Ubuntu16 | 0.9.0-cuda10.1-cudnn7-devel | [Dockerfile.cuda10.1-cudnn7.devel](../tools/Dockerfile.cuda10.1-cudnn7.devel) |
| GPU (cuda10.2-cudnn7-tensorRT6) development | Ubuntu16 | 0.9.0-cuda10.2-cudnn7-devel | [Dockerfile.cuda10.2-cudnn7.devel](../tools/Dockerfile.cuda10.2-cudnn7.devel)
| GPU (cuda10.2-cudnn8-tensorRT7) development | Ubuntu16 | 0.9.0-cuda10.2-cudnn8-devel | [Dockerfile.cuda10.2-cudnn8.devel](../tools/Dockerfile.cuda10.2-cudnn8.devel) |
| GPU (cuda11.2-cudnn8-tensorRT8) development | Ubuntu16 | 0.9.0-cuda11.2-cudnn8-devel | [Dockerfile.cuda11.2-cudnn8.devel](../tools/Dockerfile.cuda11.2-cudnn8.devel) |
**Runtime Docker Images:**
......@@ -53,14 +50,15 @@ Runtime Docker Images is lighter than Develop Images, and Running Images are mad
| Env | Version | Docker images tag | OS | Gcc Version | Size |
|----------|---------|------------------------------|-----------|-------------|------|
| CPU | 0.8.0 | 0.8.0-runtime | Ubuntu 16 | 8.2.0 | 3.9 GB |
| Cuda10.1 | 0.8.0 | 0.8.0-cuda10.1-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10 GB |
| Cuda10.2 | 0.8.0 | 0.8.0-cuda10.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 10.1 GB |
| Cuda11.2 | 0.8.0 | 0.8.0-cuda11.2-cudnn8-runtime| Ubuntu 16 | 8.2.0 | 14.2 GB |
| CPU | 0.9.0 | 0.9.0-runtime | Ubuntu 16 | 8.2.0 | 3.9 GB |
| CUDA 10.1 + cuDNN 7 | 0.9.0 | 0.9.0-cuda10.1-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10 GB |
| CUDA 10.2 + cuDNN 7 | 0.9.0 | 0.9.0-cuda10.2-cudnn7-runtime | Ubuntu 16 | 8.2.0 | 10.1 GB |
| CUDA 10.2 + cuDNN 8 | 0.9.0 | 0.9.0-cuda10.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 10.1
| CUDA 11.2 + cuDNN 8 | 0.9.0 | 0.9.0-cuda11.2-cudnn8-runtime | Ubuntu 16 | 8.2.0 | 14.2 GB |
**Java SDK Docker Image:**
```
registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-java
registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda10.2-cudnn8-java
```
**XPU Docker Images:**
......
# FAQ
# 常见问题与解答
常见问题解答分为8大类问题:
- [版本升级问题](#1)
- [基础知识](#2)
- [安装问题](#3)
- [编译问题](#4)
- [环境问题](#5)
- [部署问题](#6)
- [预测问题](#7)
- [日志排查](#8)
<a name="1"></a>
## 版本升级问题
#### Q: 从v0.6.x升级到v0.7.0版本时,运行Python Pipeline程序时报错信息如下:
#### Q: 从 `v0.6.x` 升级到 `v0.7.0` 版本时,运行 Python Pipeline 程序时报错信息如下:
```
Failed to predict: (data_id=1 log_id=0) [det|0] Failed to postprocess: postprocess() takes 4 positional arguments but 5 were given
```
**A:** 在服务端程序(例如 web_service.py)的postprocess函数定义中增加参数data_id,改为 def postprocess(self, input_dicts, fetch_dict, **data_id**, log_id) 即可。
***
<a name="2"></a>
## 基础知识
#### Q: Paddle Serving 、Paddle Inference、PaddleHub Serving三者的区别及联系?
#### Q: Paddle Serving 、Paddle Inference、PaddleHub Serving 三者的区别及联系?
**A:** paddle serving是远程服务,即发起预测的设备(手机、浏览器、客户端等)与实际预测的硬件不在一起。 paddle inference是一个library,适合嵌入到一个大系统中保证预测效率,paddle serving调用了paddle inference做远程服务。paddlehub serving可以认为是一个示例,都会使用paddle serving作为统一预测服务入口。如果在web端交互,一般是调用远程服务的形式,可以使用paddle serving的web service搭建。
**A:** Paddle Serving 是远程服务,即发起预测的设备(手机、浏览器、客户端等)与实际预测的硬件不在一起。 paddle inference 是一个 library,适合嵌入到一个大系统中保证预测效率,Paddle Serving 调用 paddle inference 做远程服务。paddlehub serving 可以认为是一个示例,都会使用 Paddle Serving 作为统一预测服务入口。如果在 web 端交互,一般是调用远程服务的形式,可以使用 Paddle Serving 的 web service 搭建。
#### Q: paddle-serving是否支持Int32支持
#### Q: Paddle Serving 支持哪些数据类型?
**A:**protobuf定feed_type和fetch_type编号与数据类型对应如下,完整信息可参考[Serving配置与启动参数说明](./Serving_Configure_CN.md#模型配置文件)
**A:** protobuf 定义中 `feed_type``fetch_type` 编号与数据类型对应如下,完整信息可参考[保存用于 Serving 部署的模型参数](./Save_CN.md)
0-int64
1-float32
2-int32
| 类型 | 类型值 |
|------|------|
| int64 | 0 |
| float32 |1 |
| int32 | 2 |
| float64 | 3 |
| int16 | 4 |
| float16 | 5 |
| bfloat16 | 6 |
| uint8 | 7 |
| int8 | 8 |
| bool | 9 |
| complex64 | 10
| complex128 | 11 |
#### Q: paddle-serving是否支持windows和Linux环境下的多线程调用
#### Q: Paddle Serving 是否支持 Windows 和 Linux 原生环境部署?
**A:** 客户端可以发起多线程访问调用服务端
**A:** 安装 `Linux Docker`,在 Docker 中部署 Paddle Serving,参考[安装指南](./Install_CN.md)
#### Q: paddle-serving如何修改消息大小限制
#### Q: Paddle Serving 如何修改消息大小限制
**A:** 在server端和client但通过FLAGS_max_body_size来扩大数据量限制,单位为字节,默认为64MB
**A:** Server 和 Client 通过修改 `FLAGS_max_body_size` 参数来扩大数据量限制,单位为字节,默认为64MB
#### Q: paddle-serving客户端目前支持哪些语言
#### Q: Paddle Serving 客户端目前支持哪些开发语言?
**A:** java c++ python
**A:** 提供 Python、C++ 和 Java SDK
#### Q: paddle-serving目前支持哪些协议
#### Q: Paddle Serving 支持哪些网络协议?
**A:** http rpc
**A:** C++ Serving 同时支持 HTTP、gRPC 和 bRPC 协议。其中 HTTP 协议既支持 HTTP + Json 格式,同时支持 HTTP + proto 格式。完整信息请阅读[C++ Serving 通讯协议](./C++_Serving/Inference_Protocols_CN.md);Python Pipeline 支持 HTTP 和 gRPC 协议,更多信息请阅读[Python Pipeline 框架设计](./Python_Pipeline/Pipeline_Features_CN.md)
***
<a name="3"></a>
## 安装问题
#### Q: pip install安装whl包过程,报错信息如下:
#### Q: `pip install` 安装 `python wheel` 过程中,报错信息如何修复?
```
Collecting opencv-python
Using cached opencv-python-4.3.0.38.tar.gz (88.0 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
ERROR: Command errored out with exit status 1:
command: /home/work/Python-2.7.17/build/bin/python /home/work/Python-2.7.17/build/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpLiweA9
cwd: /tmp/pip-install-_w6AUI/opencv-python
Complete output (22 lines):
Traceback (most recent call last):
File "/home/work/Python-2.7.17/build/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py", line 280, in <module>
main()
File "/home/work/Python-2.7.17/build/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py", line 263, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/work/Python-2.7.17/build/lib/python2.7/site-packages/pip/_vendor/pep517/_in_process.py", line 114, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-AUCbP4/overlay/lib/python2.7/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-AUCbP4/overlay/lib/python2.7/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-AUCbP4/overlay/lib/python2.7/site-packages/setuptools/build_meta.py", line 243, in run_setup
self).run_setup(setup_script=setup_script)
File "/tmp/pip-build-env-AUCbP4/overlay/lib/python2.7/site-packages/setuptools/build_meta.py", line 142, in run_setup
exec(compile(code, __file__, 'exec'), locals())
File "setup.py", line 448, in <module>
main()
File "setup.py", line 99, in main
% {"ext": re.escape(sysconfig.get_config_var("EXT_SUFFIX"))}
File "/home/work/Python-2.7.17/build/lib/python2.7/re.py", line 210, in escape
......@@ -81,9 +84,9 @@ Collecting opencv-python
TypeError: 'NoneType' object is not iterable
```
**A:** 指定opencv-python版本安装,pip install opencv-python==4.2.0.32,再安装whl包
**A:** 指定 `opencv-python` 安装版本4.2.0.32,运行 `pip3 install opencv-python==4.2.0.32`
#### Q: pip3 install whl包过程报错信息如下:
#### Q: pip3 install wheel包过程报错,详细信息如下:
```
Complete output from command python setup.py egg_info:
......@@ -94,14 +97,14 @@ Collecting opencv-python
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-taoxz02y/grpcio/
```
**A:** 需要升级pip3,再重新执行安装命令。
**A:** 需要升级 pip3 版本,再重新执行安装命令。
```
pip3 install --upgrade pip
pip3 install --upgrade setuptools
```
#### Q: 运行过程中报错,信息如下:
#### Q: 运行过程中出现 `No module named xxx` 错误,信息如下:
```
Traceback (most recent call last):
......@@ -114,26 +117,27 @@ Traceback (most recent call last):
ImportError: No module named shapely.geometry
```
**A:** 有2种方法,第一种通过pip/pip3安装shapely,第二种通过pip/pip3安装所有依赖组件
**A:** 有2种方法,第一种通过 pip3 安装shapely,第二种通过 pip3 安装所有依赖组件[requirements.txt](https://github.com/PaddlePaddle/Serving/blob/develop/python/requirements.txt)
```
方法1:
pip install shapely==1.7.0
pip3 install shapely==1.7.0
方法2:
pip install -r python/requirements.txt
pip3 install -r python/requirements.txt
```
***
<a name="4"></a>
## 编译问题
#### Q: 如何使用自己编译的Paddle Serving进行预测?
#### Q: 如何使用自己编译的 Paddle Serving 进行预测?
**A:** 通过pip命令安装自己编译出的whl包,并设置SERVING_BIN环境变量为编译出的serving二进制文件路径
**A:** 编译 Paddle Serving 请阅读[编译 Serving](https://github.com/PaddlePaddle/Serving/blob/v0.8.3/doc/Compile_CN.md)
#### Q: 使用Java客户端,mvn compile过程出现"No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?"错误
#### Q: 使用 Java 客户端,mvn compile 过程出现 "No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?" 错误
**A:** 没有安装JDK,或者JAVA_HOME路径配置错误(正确配置是JDK路径,常见错误配置成JRE路径,例如正确路径参考JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/")。Java JDK安装参考https://segmentfault.com/a/1190000015389941
**A:** 没有安装 JDK,或者 `JAVA_HOME` 路径配置错误(正确配置是 JDK 路径,常见错误配置成 JRE 路径,例如正确路径参考 `JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/"`)。Java JDK 安装参考 https://segmentfault.com/a/1190000015389941。
#### Q: 编译过程报错 /usr/local/bin/ld: cannot find -lbz2
```
......@@ -147,39 +151,17 @@ Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
```
**A:** 运行命令安装libbz2: apt install libbz2-dev
***
## 环境问题
#### Q: ImportError: dlopen: cannot load any more object with static TLS
**A:** Ubuntu 系统运行命令安装 libbz2: `apt install libbz2-dev`
**A:** 一般是用户使用Linux系统版本比较低或者Python使用的gcc版本比较低导致的,可使用以下命令检查,或者通过使用Serving或Paddle镜像安装
```
strings /lib/libc.so | grep GLIBC
```
<a name="5"></a>
#### Q:使用过程中出现CXXABI错误。
## 环境问题
这个问题出现的原因是Python使用的gcc版本和Serving所需的gcc版本对不上。对于Docker用户,推荐使用[Docker容器](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md),由于Docker容器内的Python版本与Serving在发布前都做过适配,这样就不会出现类似的错误。如果是其他开发环境,首先需要确保开发环境中具备GCC 8.2,如果没有gcc 8.2,参考安装方式
#### Q:程序运行出现 `CXXABI` 相关错误。
```bash
wget -q https://paddle-ci.gz.bcebos.com/gcc-8.2.0.tar.xz
tar -xvf gcc-8.2.0.tar.xz && \
cd gcc-8.2.0 && \
unset LIBRARY_PATH CPATH C_INCLUDE_PATH PKG_CONFIG_PATH CPLUS_INCLUDE_PATH INCLUDE && \
./contrib/download_prerequisites && \
cd .. && mkdir temp_gcc82 && cd temp_gcc82 && \
../gcc-8.2.0/configure --prefix=/usr/local/gcc-8.2 --enable-threads=posix --disable-checking --disable-multilib && \
make -j8 && make install
cd .. && rm -rf temp_gcc82
cp ${lib_so_6} ${lib_so_6}.bak && rm -f ${lib_so_6} &&
ln -s /usr/local/gcc-8.2/lib64/libgfortran.so.5 ${lib_so_5} && \
ln -s /usr/local/gcc-8.2/lib64/libstdc++.so.6 ${lib_so_6} && \
cp /usr/local/gcc-8.2/lib64/libstdc++.so.6.0.25 ${lib_path}
```
错误原因是编译 Python 使用的 GCC 版本和编译 Serving 的 GCC 版本不一致。对于 Docker 用户,推荐使用[Docker容器](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md),由于 Docker 容器内的 Python 版本与 Serving 在发布前都做过适配,这样就不会出现类似的错误。
假如已经有了GCC 8.2,可以自行安装Python,此外我们也提供了两个GCC 8.2编译的[Python2.7](https://paddle-serving.bj.bcebos.com/others/Python2.7.17-gcc82.tar)[Python3.6](https://paddle-serving.bj.bcebos.com/others/Python3.6.10-gcc82.tar) 。下载解压后,需要将对应的目录设置为`PYTHONROOT`,并设置`PATH``LD_LIBRARY_PATH`
推荐使用 GCC 8.2 预编译包 [Python3.6](https://paddle-serving.bj.bcebos.com/others/Python3.6.10-gcc82.tar) 。下载解压后,需要将对应的目录设置为 `PYTHONROOT`,并设置 `PATH``LD_LIBRARY_PATH`
```bash
export PYTHONROOT=/path/of/python # 对应解压后的Python目录
......@@ -187,13 +169,13 @@ export PATH=$PYTHONROOT/bin:$PATH
export LD_LIBRARY_PATH=$PYTHONROOT/lib:$LD_LIBRARY_PATH
```
#### Q:遇到libstdc++.so.6的版本不够的问题
#### Q:遇到 `libstdc++.so.6` 的版本不够的问题
触发该问题的原因在于,编译Paddle Serving相关可执行程序和动态库,所采用的是GCC 8.2(Cuda 9.0和10.0的Server可执行程序受限Cuda兼容性采用GCC 4.8编译)。Python在调用的过程中,有可能链接到了其他GCC版本的 `libstdc++.so`。 需要做的就是受限确保所在环境具备GCC 8.2,其次将GCC8.2的`libstdc++.so.*`拷贝到某个目录例如`/home/libstdcpp`下。最后`export LD_LIBRARY_PATH=/home/libstdcpp:$LD_LIBRARY_PATH` 即可。
触发该问题的原因在于,编译 Paddle Serving 相关可执行程序和动态库,所采用的是 GCC 8.2(Cuda 9.0 和 10.0 的 Server 可执行程序受限 CUDA 兼容性采用 GCC 4.8编译)。Python 在调用的过程中,有可能链接到了其他 GCC 版本的 `libstdc++.so`。 需要做的就是受限确保所在环境具备 GCC 8.2,其次将 GCC8.2 的`libstdc++.so.*`拷贝到某个目录例如`/home/libstdcpp` 下。最后 `export LD_LIBRARY_PATH=/home/libstdcpp:$LD_LIBRARY_PATH` 即可。
#### Q: 遇到OPENSSL_1.0.1EC 符号找不到的问题。
#### Q: 遇到 `OPENSSL_1.0.1EC` 符号找不到的问题。
目前Serving的可执行程序和客户端动态库需要链接1.0.2k版本的openssl动态库。如果环境当中没有,可以执行
目前 Serving 的可执行程序和客户端动态库需要链接 `1.0.2k` 版本的 `openssl` 动态库。如果环境当中没有,可以执行
```bash
wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
......@@ -205,43 +187,27 @@ wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
ln -sf /usr/lib/libssl.so.10 /usr/lib/libssl.so
```
其中`/usr/lib` 可以换成其他目录,并确保该目录在`LD_LIBRARY_PATH`下。
其中 `/usr/lib` 可以换成其他目录,并确保该目录在 `LD_LIBRARY_PATH` 下。
### GPU相关环境问题
#### Q:需要做哪些检查确保Serving可以运行在GPU环境
#### Q:需要做哪些检查确保 Serving 可以运行在 GPU 环境
**注:如果是使用Serving提供的镜像不需要做下列检查,如果是其他开发环境可以参考以下指导。**
**注:如果是使用 Serving 提供的镜像不需要做下列检查,如果是其他开发环境可以参考以下指导。**
首先需要确保`nvidia-smi`可用,其次需要确保所需的动态库so文件在`LD_LIBRARY_PATH`所在的目录(包括系统lib库)。
(1)Cuda显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是`libcuda.so.440.10.15`
(2)Cuda和Cudnn动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如Cuda9就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。Cuda和Cudnn与Serving的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
(1)CUDA 显卡驱动:文件名通常为 `libcuda.so.$DRIVER_VERSION` 例如驱动版本为440.10.15,文件名就是 `libcuda.so.440.10.15`
(3) Cuda10.1及更高版本需要TensorRT。安装TensorRT相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
(2)CUDA 和 cuDNN 动态库:文件名通常为 `libcudart.so.$CUDA_VERSION`,和 `libcudnn.so.$CUDNN_VERSION`。例如 CUDA9 就是 `libcudart.so.9.0`,Cudnn7就是 `libcudnn.so.7`。CUDA 和 cuDNN 与 Serving 的版本匹配参见[Serving所有镜像列表](Docker_Images_CN.md#%E9%99%84%E5%BD%95%E6%89%80%E6%9C%89%E9%95%9C%E5%83%8F%E5%88%97%E8%A1%A8).
***
## 模型参数保存问题
#### Q: 找不到'_remove_training_info'属性,详细报错信息如下:
```
python3 -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_det_infer/ \
--model_filename inference.pdmodel \
--params_filename inference.pdiparams \
--serving_server ./ppocrv2_det_serving/ \
--serving_client ./ppocrv2_det_client/
AttributeError: 'Program' object has no attribute '_remove_training_info'
```
(3) CUDA 10.1及更高版本需要 TensorRT。安装 TensorRT 相关文件的脚本参考 [install_trt.sh](../tools/dockerfiles/build_scripts/install_trt.sh).
**A:** Paddle版本低,升级Paddle版本到2.2.x及以上
***
<a name="6"></a>
## 部署问题
#### Q: GPU环境运行Serving报错,GPU count is: 0。
#### Q: GPU 环境运行 Serving 报错,GPU count is: 0。
```
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
......@@ -261,34 +227,30 @@ InvalidArgumentError: Device id must be less than GPU count, but received id is:
[Hint: Expected id < GetCUDADeviceCount(), but received id:0 >= GetCUDADeviceCount():0.] at (/home/scmbuild/workspaces_cluster.dev/baidu.lib.paddlepaddle/baidu/lib/paddlepaddle/Paddle/paddle/fluid/platform/gpu_info.cc:211)
```
**A:** libcuda.so没有链接成功。首先在机器上找到libcuda.so,ldd检查libnvidia版本与nvidia-smi中版本一致(libnvidia-fatbinaryloader.so.418.39,与NVIDIA-SMI 418.39 Driver Version: 418.39),然后用export导出libcuda.so的路径即可(例如libcuda.so在/usr/lib64/,export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/)
**A:** 原因是 `libcuda.so` 没有链接成功。首先在机器上找到 `libcuda.so`,使用 `ldd` 命令检查 libnvidia 版本与 nvidia-smi 中版本是否一致(libnvidia-fatbinaryloader.so.418.39,与NVIDIA-SMI 418.39 Driver Version: 418.39),然后用 export 导出 `libcuda.so` 的路径即可(例如 libcuda.so 在 /usr/lib64/,export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/)
#### Q: 遇到 GPU not found, please check your environment or use cpu version by "pip install paddle_serving_server"
**A:** 检查环境中是否有N卡:ls /dev/ | grep nvidia
#### Q: 目前Paddle Serving支持哪些镜像环境?
**A:** 检查环境中是否有N卡:`ls /dev/ | grep nvidia`
**A:** 目前(0.4.0)仅支持CentOS,具体列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
#### Q: Paddle Serving 支持哪些镜像环境?
#### Q: python编译的GCC版本与serving的版本不匹配
**A:** 支持 CentOS 和 Ubuntu 环境镜像 ,完整列表查阅[这里](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)
**A:**:1)使用GPU Dockers, [这里是Docker镜像列表](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Docker_Images_CN.md)解决环境问题;2)修改anaconda的虚拟环境下安装的python的gcc版本[改变python的GCC编译环境](https://www.jianshu.com/p/c498b3d86f77)
#### Q: Paddle Serving 是否支持本地离线安装
#### Q: paddle-serving是否支持本地离线安装
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md) 提前准备安装好
**A:** 支持离线部署,需要把一些相关的[依赖包](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Compile_CN.md)提前准备安装好
#### Q: Docker 中启动 Server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 必须将容器的主进程设置为绑定到特殊的 `0.0.0.0` 表示“所有接口”地址,否则它将无法从容器外部访问。在 Docker 中 `127.0.0.1` 仅代表“这个容器”,而不是“这台机器”。如果您从容器建立到 `127.0.0.1` 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 `127.0.0.1`,接收不到来自外部的连接。
#### Q: Docker中启动server IP地址 127.0.0.1 与 0.0.0.0 差异
**A:** 您必须将容器的主进程设置为绑定到特殊的 0.0.0.0 “所有接口”地址,否则它将无法从容器外部访问。在Docker中 127.0.0.1 代表“这个容器”,而不是“这台机器”。如果您从容器建立到 127.0.0.1 的出站连接,它将返回到同一个容器;如果您将服务器绑定到 127.0.0.1,接收不到来自外部的连接。
***
<a name="7"></a>
## 预测问题
#### Q: 使用GPU第一次预测时特别慢,如何调整RPC服务的等待时间避免超时?
#### Q: 使用 GPU 第一次预测时特别慢,如何调整 RPC 服务的等待时间避免超时?
**A:** GPU第一次预测需要初始化。使用set_rpc_timeout_ms设置更长的等待时间,单位为毫秒,默认时间为20秒。
**A:** GPU 第一次预测需要初始化。使用 `set_rpc_timeout_ms` 设置更长的等待时间,单位为毫秒,默认时间为20秒。
示例:
......@@ -300,76 +262,67 @@ client.load_client_config(sys.argv[1])
client.set_rpc_timeout_ms(100000)
client.connect(["127.0.0.1:9393"])
```
#### Q: 执行 GPU 预测时遇到 `ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (../batch_norm_op.cu:198)`错误
#### Q: 执行GPU预测时遇到InvalidArgumentError: Device id must be less than GPU count, but received id is: 0. GPU count is: 0.
**A:** 将显卡驱动对应的libcuda.so的目录添加到LD_LIBRARY_PATH环境变量中
**A:** 将 cuDNN 的 lib64路径添加到 `LD_LIBRARY_PATH`,安装自 `pypi` 的 Paddle Serving 中 `post9` 版本使用的是 `cuDNN 7.3,post10` 使用的是 `cuDNN 7.5。如果是使用自己编译的 Paddle Serving,可以在 `log/serving.INFO` 日志文件中查看对应的 cuDNN 版本。
#### Q: 执行GPU预测时遇到ExternalError: Cudnn error, CUDNN_STATUS_BAD_PARAM at (../batch_norm_op.cu:198)
#### Q: 执行 GPU 预测时遇到 `Error: Failed to find dynamic library: libcublas.so`
**A:**cudnn的lib64路径添加到LD_LIBRARY_PATH,安装自pypi的Paddle Serving中post9版使用的是cudnn 7.3,post10使用的是cudnn 7.5。如果是使用自己编译的Paddle Serving,可以在log/serving.INFO日志文件中查看对应的cudnn版本
**A:** 将 CUDA 的 lib64路径添加到 `LD_LIBRARY_PATH`, post9 版本的 Paddle Serving 使用的是 `cuda 9.0,post10` 版本使用的 `cuda 10.0`
#### Q: 执行GPU预测时遇到Error: Failed to find dynamic library: libcublas.so
#### Q: Client 的 `fetch var`变量名如何设置
**A:** 将cuda的lib64路径添加到LD_LIBRARY_PATH, post9版本的Paddle Serving使用的是cuda 9.0,post10版本使用的cuda 10.0。
#### Q: Client端fetch的变量名如何设置
**A:** 可以查看配置文件serving_server_conf.prototxt,获取需要的变量名
**A:** 通过[保存用于 Serving 部署的模型参数](https://github.com/PaddlePaddle/Serving/blob/v0.8.3/doc/Save_EN.md) 生成配置文件 `serving_server_conf.prototxt`,获取需要的变量名。
#### Q: 如何使用多语言客户端
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.4.0),服务端需要将Server改为MultiLangServer(如果是以命令行启动的话只需要添加--use_multilang参数),Python客户端需要将Client改为MultiLangClient,同时去除load_client_config的过程。[Java客户端参考文档](https://github.com/PaddlePaddle/Serving/blob/develop/doc/Java_SDK_CN.md)
**A:** 多语言客户端要与多语言服务端配套使用。当前版本下(0.8.3)
#### Q: 如何在Windows下使用Paddle Serving
#### Q: 如何在 Windows 下使用 Paddle Serving
**A:** 当前版本(0.4.0)在Windows上可以运行多语言RPC客户端,或使用HTTP方式访问。如果使用多语言RPC客户端,需要在Linux环境(比如本机容器,或远程Linux机器)中运行多语言服务端;如果使用HTTP方式,需要在Linux环境中运行普通服务端
**A:** 在 Windows 上可以运行多语言 RPC 客户端,或使用 HTTP 方式访问。
#### Q: libnvinfer.so: cannot open shared object file: No such file or directory)
#### Q: 报错信息 `libnvinfer.so: cannot open shared object file: No such file or directory)`
**A:** 参考该文档安装TensorRT: https://blog.csdn.net/hesongzefairy/article/details/105343525
**A:** 没有安装 TensorRT,安装 TensorRT 请参考链接: https://blog.csdn.net/hesongzefairy/article/details/105343525
***
<a name="8"></a>
## 日志排查
#### Q: 部署和预测中的日志信息在哪里查看?
**A:** server端的日志分为两部分,一部分打印到标准输出,一部分打印到启动服务时的目录下的log/serving.INFO文件中。
client端的日志直接打印到标准输出。
**A:** Server 的日志分为两部分,一部分打印到标准输出,一部分打印到启动服务时的目录下的 `log/serving.INFO` 文件中。
Client 的日志直接打印到标准输出。
通过在部署服务之前 'export GLOG_v=3'可以输出更为详细的日志信息。
#### Q: paddle-serving启动成功后,相关的日志在哪里设置
#### Q: C++ Serving 启动成功后,日志文件在哪里,在哪里设置日志级别?
**A:** 1)警告是glog组件打印的,告知glog初始化之前日志打印在STDERR
2)一般采用GLOG_v方式启动服务同时设置日志级别。
**A:** C++ Serving 服务的所有日志在程序运行的当前目录的`log/`目录下,分为 serving.INFO、serving.WARNING 和 serving.ERROR 文件。
1)警告是 `glog` 组件打印的,告知 `glog` 初始化之前日志打印在 STDERR;
2)一般采用 `GLOG_v` 方式启动服务同时设置日志级别。
例如:
```
GLOG_v=2 python -m paddle_serving_server.serve --model xxx_conf/ --port 9999
```
#### Q: Python Pipeline 启动成功后,日志文件在哪里,在哪里设置日志级别?
#### Q: (GLOG_v=2下)Server端日志一切正常,但Client端始终得不到正确的预测结果
**A:** 可能是配置文件有问题,检查下配置文件(is_load_tensor,fetch_type等有没有问题)
**A:** Python Pipeline 服务的日志信息请阅读[Python Pipeline 设计](./Python_Pipeline/Pipeline_Design_CN.md) 第三节服务日志。
#### Q: 如何给Server传递Logid
#### Q: (GLOG_v=2下)Server 日志一切正常,但 Client 始终得不到正确的预测结果
**A:** Logid默认为0(后续应该有自动生成Logid的计划,当前版本0.4.0),Client端通过在predict函数中指定log_id参数传递
#### Q: C++Server出现问题如何调试和定位
**A:** 推荐您使用gdb进行定位和调试,如果您使用docker,在启动容器时候,需要加上docker run --privileged参数,开启特权模式,这样才能在docker容器中使用gdb定位和调试
**A:** 可能是配置文件有问题,检查下配置文件(is_load_tensor,fetch_type等有没有问题)
如果您C++端出现coredump,一般而言会生成一个core文件,若没有,则应开启生成core文件选项,使用ulimit -c unlimited命令。
#### Q: 如何给 Server 传递 Logid
使用gdb调试core文件的方法为:gdb <可执行文件> <core文件>,进入后输入bt指令,一般即可显示出错在哪一行。
**A:** Logid 默认为0,Client 通过在 predict 函数中指定 log_id 参数
注意:可执行文件路径是C++ bin文件的路径,而不是python命令,一般为类似下面的这种/usr/local/lib/python3.6/site-packages/paddle_serving_server/serving-gpu-102-0.7.0/serving
#### Q: C++ Serving 出现问题如何调试和定位
**A:** 推荐您使用 GDB 进行定位和调试,如果您使用 Serving 的 Docker,在启动容器时候,需要加上 `docker run --privileged `参数,开启特权模式,这样才能在 docker 容器中使用 GDB 定位和调试
如果 C++ Serving 出现 `core dump`,一般会生成 core 文件,若没有,运行 `ulimit -c unlimited`命令开启core dump。
使用 GDB 调试 core 文件的方法为:`gdb <可执行文件> <core文件>`,进入后输入 `bt` 指令显示栈信息。
注意:可执行文件路径是 C++ bin 文件的路径,而不是 python 命令,一般为类似下面的这种 `/usr/local/lib/python3.6/site-packages/paddle_serving_server/serving-gpu-102-0.7.0/serving`
......@@ -6,7 +6,7 @@
**提示-1**:本项目仅支持<mark>**Python3.6/3.7/3.8/3.9**</mark>,接下来所有的与Python/Pip相关的操作都需要选择正确的Python版本。
**提示-2**:以下示例中GPU环境均为cuda10.2-cudnn7,如果您使用Python Pipeline来部署,并需要Nvidia TensorRT来优化预测性能,请参考[支持的镜像环境和说明](#4支持的镜像环境和说明)来选择其他版本。
**提示-2**:以下示例中GPU环境均为cuda11.2-cudnn8,如果您使用Python Pipeline来部署,并需要Nvidia TensorRT来优化预测性能,请参考[支持的镜像环境和说明](#4支持的镜像环境和说明)来选择其他版本。
## 1.启动开发镜像
......@@ -15,16 +15,16 @@
**CPU:**
```
# 启动 CPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.8.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.8.0-devel bash
docker pull registry.baidubce.com/paddlepaddle/serving:0.9.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.9.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# 启动 GPU Docker
docker pull registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-cudnn7-devel bash
docker pull registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda11.2-cudnn8-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda11.2-cudnn8-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
......@@ -32,8 +32,8 @@ git clone https://github.com/PaddlePaddle/Serving
**CPU:**
```
# 启动 CPU Docker
docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.2.2 bash
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.3.0
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.3.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
......@@ -43,8 +43,9 @@ bash Serving/tools/paddle_env_install.sh
**GPU:**
```
# 启动 GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 bash
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.3.0-gpu-cuda11.2-cudnn8
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.3.0-gpu-cuda11.2-cudnn8 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
......@@ -60,21 +61,21 @@ pip3 install -r python/requirements.txt
```
安装服务whl包,共有3种client、app、server,Server分为CPU和GPU,GPU包根据您的环境选择一种安装
- post102 = CUDA10.2 + cuDNN7 + TensorRT6(推荐)
- post112 = CUDA11.2 + cuDNN8 + TensorRT8(推荐)
- post101 = CUDA10.1 + cuDNN7 + TensorRT6
- post112 = CUDA11.2 + cuDNN8 + TensorRT8
- post102 = CUDA10.2 + cuDNN7 + TensorRT6 (与Paddle 镜像一致)
- post1028 = CUDA10.2 + cuDNN8 + TensorRT7
```shell
pip3 install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-client==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-app==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CPU Server
pip3 install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU Server,需要确认环境再选择执行哪一条,推荐使用CUDA 10.2的包
pip3 install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU Server,需要确认环境再选择执行哪一条,推荐使用CUDA 11.2的包
pip3 install paddle-serving-server-gpu==0.9.0.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
默认开启国内清华镜像源来加速下载,如果您使用HTTP代理可以关闭(`-i https://pypi.tuna.tsinghua.edu.cn/simple`)
......@@ -85,45 +86,46 @@ paddle-serving-server和paddle-serving-server-gpu安装包支持Centos 6/7, Ubun
paddle-serving-client和paddle-serving-app安装包支持Linux和Windows,其中paddle-serving-client仅支持python3.6/3.7/3.8/3.9。
**如果您之前使用paddle serving 0.5.X 0.6.X的Cuda10.2环境,需要注意在0.8.0版本,paddle-serving-server-gpu==0.8.0.post102的使用Cudnn7和TensorRT6,而0.6.0.post102使用cudnn8和TensorRT7。如果0.6.0的cuda10.2用户需要升级安装,请使用paddle-serving-server-gpu==0.8.0.post1028**
## 3.安装Paddle相关Python库
**当您使用`paddle_serving_client.convert`命令或者`Python Pipeline框架`时才需要安装。**
```
# CPU环境请执行
pip3 install paddlepaddle==2.2.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddlepaddle==2.3.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU CUDA 10.2环境请执行
pip3 install paddlepaddle-gpu==2.2.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU CUDA 11.2环境请执行
pip3 install paddlepaddle-gpu==2.3.0.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
**注意**如果您的Cuda版本不是10.2,或者您需要在GPU环境上使用TensorRT,请勿直接执行上述命令,需要参考[Paddle-Inference官方文档-下载安装Linux预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)选择相应的GPU环境的url链接并进行安装。举例假设您使用python3.6,请执行如下命令
**注意**其他版本请参考[Paddle-Inference官方文档-下载安装Linux预测库](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python) 选择相应的GPU环境的 URL 链接并进行安装
```
# CUDA10.1 + CUDNN7 + TensorRT6
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.2.post101-cp36-cp36m-linux_x86_64.whl
# CUDA11.2 + CUDNN8 + TensorRT8 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp39-cp39-linux_x86_64.whl
# CUDA10.2 + CUDNN7 + TensorRT6, 需要注意的是此环境和Cuda10.1+Cudnn7+TensorRT6使用同一个paddle whl包
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.2.post101-cp36-cp36m-linux_x86_64.whl
# CUDA10.1 + CUDNN7 + TensorRT6 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp39-cp39-linux_x86_64.whl
# CUDA10.2 + CUDNN8 + TensorRT7
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.2.2-cp36-cp36m-linux_x86_64.whl
# CUDA10.2 + CUDNN8 + TensorRT7 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp39-cp39-linux_x86_64.whl
# CUDA11.2 + CUDNN8 + TensorRT8
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.2.2.post112-cp36-cp36m-linux_x86_64.whl
```
例如CUDA 10.1的Python3.6用户,请选择表格当中的`cp36-cp36m``linux-cuda10.1-cudnn7.6-trt6-gcc8.2`对应的url,复制下来并执行
```
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.2.post101-cp36-cp36m-linux_x86_64.whl
```
## 4.支持的镜像环境和说明
| 环境 | Serving开发镜像Tag | 操作系统 | Paddle开发镜像Tag | 操作系统 |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.8.0-devel | Ubuntu 16.04 | 2.2.2 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.8.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN7 | 0.8.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda10.2-cudnn7 | Ubuntu 16.04 |
| CUDA10.2 + CUDNN8 | 0.8.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA11.2 + CUDNN8 | 0.8.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
| CPU | 0.9.0-devel | Ubuntu 16.04 | 2.3.0 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.9.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN8 | 0.9.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | Ubuntu 18.04 |
| CUDA11.2 + CUDNN8 | 0.9.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.3.0-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
对于**Windows 10 用户**,请参考文档[Windows平台使用Paddle Serving指导](Windows_Tutorial_CN.md)
......@@ -132,4 +134,4 @@ pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x
```
python3 -m paddle_serving_server.serve check
```
详情请参考[环境检查文档](./Check_Env_CN.md)
详情请参考[环境检查文档](./Check_Env_CN.md)
\ No newline at end of file
......@@ -6,7 +6,7 @@
**Tip-1**: This project only supports <mark>**Python3.6/3.7/3.8/3.9**</mark>, all subsequent operations related to Python/Pip need to select the correct Python version.
**Tip-2**: The GPU environments in the following examples are all cuda10.2-cudnn7. If you use Python Pipeline to deploy and need Nvidia TensorRT to optimize prediction performance, please refer to [Supported Mirroring Environment and Instructions](#4.-Supported-Docker-Images-and-Instruction) to choose other versions.
**Tip-2**: The GPU environments in the following examples are all cuda11.2-cudnn8. If you use Python Pipeline to deploy and need Nvidia TensorRT to optimize prediction performance, please refer to [Supported Mirroring Environment and Instructions](#4.-Supported-Docker-Images-and-Instruction) to choose other versions.
## 1. Start the Docker Container
<mark>**Both Serving Dev Image and Paddle Dev Image are supported at the same time. You can choose 1 from the operation 2 in chapters 1.1 and 1.2.**</mark>Deploying the Serving service on the Paddle docker image requires the installation of additional dependency libraries. Therefore, we directly use the Serving development image.
......@@ -15,16 +15,16 @@
**CPU:**
```
# Start CPU Docker Container
docker pull registry.baidubce.com/paddlepaddle/serving:0.8.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.8.0-devel bash
docker pull registry.baidubce.com/paddlepaddle/serving:0.9.0-devel
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.9.0-devel bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
**GPU:**
```
# Start GPU Docker Container
docker pull registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.2-cudnn7-devel bash
docker pull registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda11.2-cudnn7-devel
nvidia-docker run -p 9292:9292 --name test -dit docker pull registry.baidubce.com/paddlepaddle/serving:0.9.0-cuda11.2-cudnn7-devel bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
```
......@@ -32,8 +32,8 @@ git clone https://github.com/PaddlePaddle/Serving
**CPU:**
```
# Start CPU Docker Container
docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.2.2 bash
docker pull registry.baidubce.com/paddlepaddle/paddle:2.3.0
docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.3.0 bash
docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
......@@ -43,8 +43,8 @@ bash Serving/tools/paddle_env_install.sh
**GPU:**
```
# Start GPU Docker
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 bash
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.3.0-gpu-cuda11.2-cudnn8
nvidia-docker run -p 9292:9292 --name test -dit registry.baidubce.com/paddlepaddle/paddle:2.3.0-gpu-cuda11.2-cudnn8 bash
nvidia-docker exec -it test bash
git clone https://github.com/PaddlePaddle/Serving
......@@ -61,21 +61,22 @@ pip3 install -r python/requirements.txt
```
Install the service whl package. There are three types of client, app and server. The server is divided into CPU and GPU. Choose one installation according to the environment.
- post102 = CUDA10.2 + cuDNN7 + TensorRT6(Recommended)
- post112 = CUDA11.2 + cuDNN8 + TensorRT8(Recommanded)
- post101 = CUDA10.1 + cuDNN7 + TensorRT6
- post112 = CUDA11.2 + cuDNN8 + TensorRT8
- post102 = CUDA10.2 + cuDNN8 + TensorRT7
```shell
pip3 install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-client==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-app==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CPU Server
pip3 install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server==0.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU environments need to confirm the environment before choosing which one to execute
pip3 install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.9.0.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.9.0.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-server-gpu==0.9.0.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
By default, the domestic Tsinghua mirror source is turned on to speed up the download. If you use a proxy, you can turn it off(`-i https://pypi.tuna.tsinghua.edu.cn/simple`).
......@@ -86,31 +87,35 @@ The paddle-serving-server and paddle-serving-server-gpu installation packages su
The paddle-serving-client and paddle-serving-app installation packages support Linux and Windows, and paddle-serving-client only supports python3.6/3.7/3.8/3.9.
**If you used the CUDA10.2 environment of paddle serving 0.5.X 0.6.X before, you need to pay attention to version 0.8.0, paddle-serving-server-gpu==0.8.0.post102 uses Cudnn7 and TensorRT6, and 0.6.0.post102 uses cudnn8 and TensorRT7. If 0.6.0 cuda10.2 users need to upgrade, please use paddle-serving-server-gpu==0.8.0.post1028**
## 3. Install Paddle related Python libraries
**You only need to install it when you use the `paddle_serving_client.convert` command or the `Python Pipeline framework`. **
```
# CPU environment please execute
pip3 install paddlepaddle==2.2.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddlepaddle==2.3.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU CUDA 10.2 environment please execute
pip3 install paddlepaddle-gpu==2.2.2 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU CUDA 11.2 environment please execute
pip3 install paddlepaddle-gpu==2.3.0.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
**Note**: If your CUDA version is not 10.2 or if you want to use TensorRT(CUDA10.2 included), please do not execute the above commands directly, you need to refer to [Paddle-Inference official document-download and install the Linux prediction library](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python) Select the URL link of the corresponding GPU environment and install it. Assuming that you use Python3.6, please follow the codeblock.
**Note**: If you want to use other versions, please do not execute the above commands directly, you need to refer to [Paddle-Inference official document-download and install the Linux prediction library](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python) Select the URL link of the corresponding GPU environment and install it. Assuming that you use Python3.6, please follow the codeblock.
```
# CUDA10.1 + CUDNN7 + TensorRT6
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.2.post101-cp36-cp36m-linux_x86_64.whl
# CUDA10.2 + CUDNN7 + TensorRT6, Attenton that the paddle whl for this env is same to that of CUDA10.1 + Cudnn7 + TensorRT6
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.2.2.post101-cp36-cp36m-linux_x86_64.whl
# CUDA11.2 + CUDNN8 + TensorRT8 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.3.0.post112-cp39-cp39-linux_x86_64.whl
# CUDA10.2 + Cudnn8 + TensorRT7
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.2.2-cp36-cp36m-linux_x86_64.whl
# CUDA10.1 + CUDNN7 + TensorRT6 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.1_cudnn7.6.5_trt6.0.1.5/paddlepaddle_gpu-2.3.0.post101-cp39-cp39-linux_x86_64.whl
# CUDA11.2 + CUDNN8 + TensorRT8
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.2.2.post112-cp36-cp36m-linux_x86_64.whl
# CUDA10.2 + CUDNN8 + TensorRT7 + Python(3.6-3.9)
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp36-cp36m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp37-cp37m-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp38-cp38-linux_x86_64.whl
pip3 install https://paddle-inference-lib.bj.bcebos.com/2.3.0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda10.2_cudnn8.1.1_trt7.2.3.4/paddlepaddle_gpu-2.3.0-cp39-cp39-linux_x86_64.whl
```
## 4. Supported Docker Images and Instruction
......@@ -118,11 +123,10 @@ pip3 install https://paddle-inference-lib.bj.bcebos.com/2.2.2/python/Linux/GPU/x
| Environment | Serving Development Image Tag | Operating System | Paddle Development Image Tag | Operating System |
| :--------------------------: | :-------------------------------: | :-------------: | :-------------------: | :----------------: |
| CPU | 0.8.0-devel | Ubuntu 16.04 | 2.2.2 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.8.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN7 | 0.8.0-cuda10.2-cudnn7-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda10.2-cudnn7 | Ubuntu 16.04 |
| CUDA10.2 + CUDNN8 | 0.8.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA11.2 + CUDNN8 | 0.8.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.2.2-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
| CPU | 0.9.0-devel | Ubuntu 16.04 | 2.3.0 | Ubuntu 18.04. |
| CUDA10.1 + CUDNN7 | 0.9.0-cuda10.1-cudnn7-devel | Ubuntu 16.04 | 无 | 无 |
| CUDA10.2 + CUDNN8 | 0.9.0-cuda10.2-cudnn8-devel | Ubuntu 16.04 | 无 | Ubuntu 18.04 |
| CUDA11.2 + CUDNN8 | 0.9.0-cuda11.2-cudnn8-devel | Ubuntu 16.04 | 2.3.0-gpu-cuda11.2-cudnn8 | Ubuntu 18.04 |
For **Windows 10 users**, please refer to the document [Paddle Serving Guide for Windows Platform](Windows_Tutorial_CN.md).
......
# 原生系统标准环境安装
本文介绍基于原生系统标准环境进行配置安装。
<img src="images/2-2_Environment_CN_1.png">
## CentOS 7 环境配置(第一步)
**一.环境准备**
* **Python 版本 3.6/3.7/3.8/3.9 (64 bit)**
**二.选择 CPU/GPU**
* 如果您的计算机有 NVIDIA® GPU,请确保满足以下条件
* **CUDA 工具包:10.1/10.2 配合 cuDNN 7 (cuDNN 版本>=7.6.5) 或者 11.2 配合 cuDNN v8.1.1**
* **兼容版本的 TensorRT**
* **GPU运算能力超过3.5的硬件设备**
您可参考NVIDIA官方文档了解CUDA和CUDNN的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/index.html), [GPU算力](https://developer.nvidia.com/cuda-gpus)
**三.安装必要工具**
需要安装的依赖库及工具详见下表:
| 组件 | 版本要求 |
| :--------------------------: | :-------------------------------: |
| bzip2-devel | 1.0.6 and later |
| make | later |
| gcc | 8.2.0 |
| gcc-c++ | 8.2.0 |
| cmake | 3.15.0 and later |
| Go | 1.17.2 and later |
| openssl-devel | 1.0.2k |
| patchelf | 0.9 |
1. 更新系统源
更新`yum`的源:
```
yum update
```
并添加必要的yum源:
```
yum install -y epel-release
```
2. 安装工具
`bzip2`以及`make`:
```
yum install -y bzip2
```
```
yum install -y make
```
cmake 需要3.15以上,建议使用3.16.0:
```
wget -q https://cmake.org/files/v3.16/cmake-3.16.0-Linux-x86_64.tar.gz
```
```
tar -zxvf cmake-3.16.0-Linux-x86_64.tar.gz
```
```
rm cmake-3.16.0-Linux-x86_64.tar.gz
```
```
PATH=/home/cmake-3.16.0-Linux-x86_64/bin:$PATH
```
gcc 需要5.4以上,建议使用8.2.0:
```
wget -q https://paddle-docker-tar.bj.bcebos.com/home/users/tianshuo/bce-python-sdk-0.8.27/gcc-8.2.0.tar.xz && \
tar -xvf gcc-8.2.0.tar.xz && \
cd gcc-8.2.0 && \
sed -i 's#ftp://gcc.gnu.org/pub/gcc/infrastructure/#https://paddle-ci.gz.bcebos.com/#g' ./contrib/download_prerequisites && \
unset LIBRARY_PATH CPATH C_INCLUDE_PATH PKG_CONFIG_PATH CPLUS_INCLUDE_PATH INCLUDE && \
./contrib/download_prerequisites && \
cd .. && mkdir temp_gcc82 && cd temp_gcc82 && \
../gcc-8.2.0/configure --prefix=/usr/local/gcc-8.2 --enable-threads=posix --disable-checking --disable-multilib && \
make -j8 && make install
```
3. 安装GOLANG
建议使用 go1.17.2:
```
wget -qO- https://go.dev/dl/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \
mkdir /root/go && \
mkdir /root/go/bin && \
mkdir /root/go/src && \
echo "GOROOT=/usr/local/go" >> /root/.bashrc && \
echo "GOPATH=/root/go" >> /root/.bashrc && \
echo "PATH=/usr/local/go/bin:/root/go/bin:$PATH" >> /root/.bashrc
source /root/.bashrc
```
4. 安装依赖库
安装相关依赖库 patchelf:
```
yum install patchelf
```
配置 ssl 依赖库
```
wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
tar xf centos_ssl.tar && rm -rf centos_ssl.tar && \
mv libcrypto.so.1.0.2k /usr/lib/libcrypto.so.1.0.2k && mv libssl.so.1.0.2k /usr/lib/libssl.so.1.0.2k && \
ln -sf /usr/lib/libcrypto.so.1.0.2k /usr/lib/libcrypto.so.10 && \
ln -sf /usr/lib/libssl.so.1.0.2k /usr/lib/libssl.so.10 && \
ln -sf /usr/lib/libcrypto.so.10 /usr/lib/libcrypto.so && \
ln -sf /usr/lib/libssl.so.10 /usr/lib/libssl.so
```
## Ubuntu 16.04/18.04 环境配置(第一步)
**一.环境准备**
* **Python 版本 3.6/3.7/3.8/3.9 (64 bit)**
**二.选择 CPU/GPU**
* 如果您的计算机有 NVIDIA® GPU,请确保满足以下条件
* **CUDA 工具包 10.1/10.2 配合 cuDNN 7 (cuDNN 版本>=7.6.5)**
* **CUDA 工具包 11.2 配合 cuDNN v8.1.1**
* **配套版本的 TensorRT**
* **GPU运算能力超过3.5的硬件设备**
您可参考NVIDIA官方文档了解CUDA和CUDNN的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/index.html)
**三.安装必要工具**
1. 更新系统源
更新`apt`的源:
```
apt update
```
2. 安装工具
`bzip2`以及`make`:
```
apt install -y bzip2
```
```
apt install -y make
```
cmake 需要3.15以上,建议使用3.16.0:
```
wget -q https://cmake.org/files/v3.16/cmake-3.16.0-Linux-x86_64.tar.gz
```
```
tar -zxvf cmake-3.16.0-Linux-x86_64.tar.gz
```
```
rm cmake-3.16.0-Linux-x86_64.tar.gz
```
```
PATH=/home/cmake-3.16.0-Linux-x86_64/bin:$PATH
```
gcc 需要5.4以上,建议使用8.2.0:
```
wget -q https://paddle-docker-tar.bj.bcebos.com/home/users/tianshuo/bce-python-sdk-0.8.27/gcc-8.2.0.tar.xz && \
tar -xvf gcc-8.2.0.tar.xz && \
cd gcc-8.2.0 && \
sed -i 's#ftp://gcc.gnu.org/pub/gcc/infrastructure/#https://paddle-ci.gz.bcebos.com/#g' ./contrib/download_prerequisites && \
unset LIBRARY_PATH CPATH C_INCLUDE_PATH PKG_CONFIG_PATH CPLUS_INCLUDE_PATH INCLUDE && \
./contrib/download_prerequisites && \
cd .. && mkdir temp_gcc82 && cd temp_gcc82 && \
../gcc-8.2.0/configure --prefix=/usr/local/gcc-8.2 --enable-threads=posix --disable-checking --disable-multilib && \
make -j8 && make install
```
3. 安装GOLANG
建议使用 go1.17.2:
```
wget -qO- https://go.dev/dl/go1.17.2.linux-amd64.tar.gz | \
tar -xz -C /usr/local && \
mkdir /root/go && \
mkdir /root/go/bin && \
mkdir /root/go/src && \
echo "GOROOT=/usr/local/go" >> /root/.bashrc && \
echo "GOPATH=/root/go" >> /root/.bashrc && \
echo "PATH=/usr/local/go/bin:/root/go/bin:$PATH" >> /root/.bashrc
source /root/.bashrc
```
4. 安装依赖库
安装相关依赖库 patchelf:
```
apt-get install patchelf
```
配置 ssl 依赖库
```
wget https://paddle-serving.bj.bcebos.com/others/centos_ssl.tar && \
tar xf centos_ssl.tar && rm -rf centos_ssl.tar && \
mv libcrypto.so.1.0.2k /usr/lib/libcrypto.so.1.0.2k && mv libssl.so.1.0.2k /usr/lib/libssl.so.1.0.2k && \
ln -sf /usr/lib/libcrypto.so.1.0.2k /usr/lib/libcrypto.so.10 && \
ln -sf /usr/lib/libssl.so.1.0.2k /usr/lib/libssl.so.10 && \
ln -sf /usr/lib/libcrypto.so.10 /usr/lib/libcrypto.so && \
ln -sf /usr/lib/libssl.so.10 /usr/lib/libssl.so
```
## Windows 环境配置(第一步)
由于受限第三方库的支持,Windows平台目前只支持用web service的方式搭建local predictor预测服务。
**一.环境准备**
* **Python 版本 3.6/3.7/3.8/3.9 (64 bit)**
**二.选择 CPU/GPU**
* 如果您的计算机有 NVIDIA® GPU,请确保满足以下条件
* **CUDA 工具包 10.1/10.2 配合 cuDNN 7 (cuDNN 版本>=7.6.5)**
* **CUDA 工具包 11.2 配合 cuDNN v8.1.1**
* **配套版本的 TensorRT**
* **GPU运算能力超过3.5的硬件设备**
您可参考NVIDIA官方文档了解CUDA和CUDNN的安装流程和配置方法,请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/),[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/),[TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/index.html)
**三.安装必要工具**
1. 更新 wget 工具
在链接[下载wget](http://gnuwin32.sourceforge.net/packages/wget.htm),解压后复制到`C:\Windows\System32`下,如有安全提示需要通过。
2. 安装git工具
详情参见[Git官网](https://git-scm.com/downloads)
3. 安装必要的C++库(可选)
部分用户可能会在`import paddle`阶段遇见dll无法链接的问题,建议[安装Visual Studio社区版本](https://visualstudio.microsoft.com/) ,并且安装C++的相关组件。
## 使用 pip 安装(第二步)
**一. 安装服务 whl 包**
服务 whl 包包括: client、app、server,其中 Server 分为 CPU 和 GPU,GPU 包根据您的环境选择一种安装
```
pip3 install paddle-serving-client==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip3 install paddle-serving-app==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CPU Server
pip3 install paddle-serving-server==0.8.3 -i https://pypi.tuna.tsinghua.edu.cn/simple
# GPU Server,需要确认环境再选择执行哪一条,推荐使用CUDA 10.2的包
# CUDA10.2 + Cudnn7 + TensorRT6(推荐)
pip3 install paddle-serving-server-gpu==0.8.3.post102 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CUDA10.1 + TensorRT6
pip3 install paddle-serving-server-gpu==0.8.3.post101 -i https://pypi.tuna.tsinghua.edu.cn/simple
# CUDA11.2 + TensorRT8
pip3 install paddle-serving-server-gpu==0.8.3.post112 -i https://pypi.tuna.tsinghua.edu.cn/simple
```
默认开启国内清华镜像源来加速下载,如果您使用 HTTP 代理可以关闭(`-i https://pypi.tuna.tsinghua.edu.cn/simple`)
**二. 安装 Paddle 相关 Python 库**
**当您使用`paddle_serving_client.convert`命令或者`Python Pipeline 框架`时才需要安装。**
```
# CPU 环境请执行
pip3 install paddlepaddle==2.2.2
# GPU CUDA 10.2环境请执行
pip3 install paddlepaddle-gpu==2.2.2
```
**注意**: 如果您的 Cuda 版本不是10.2,或者您需要在 GPU 环境上使用 TensorRT,请勿直接执行上述命令,需要参考[Paddle-Inference官方文档-下载安装Linux预测库](https:/paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html#python)选择相应的 GPU 环境的 url 链接并进行安装。
**三. 安装完成后的环境检查**
当以上步骤均完成后可使用命令行运行环境检查功能,自动运行 Paddle Serving 相关示例,进行环境相关配置校验。
```
python3 -m paddle_serving_server.serve check
# 以下输出表明环境检查正常
(Cmd) check_all
PaddlePaddle inference environment running success
C++ cpu environment running success
C++ gpu environment running success
Pipeline cpu environment running success
Pipeline gpu environment running success
```
详情请参考[环境检查文档](./Check_Env_CN.md)
\ No newline at end of file
......@@ -17,7 +17,7 @@ Paddle Serving 提供了 Java SDK,支持 Client 端用 Java 语言进行预测
| Paddle Serving Server version | Java SDK version |
| :---------------------------: | :--------------: |
| 0.8.0 | 0.0.1 |
| 0.9.0 | 0.0.1 |
1. 直接使用提供的Java SDK作为Client进行预测
### 安装
......
......@@ -18,7 +18,7 @@ The following table shows compatibilities between Paddle Serving Server and Java
| Paddle Serving Server version | Java SDK version |
| :---------------------------: | :--------------: |
| 0.8.0 | 0.0.1 |
| 0.9.0 | 0.0.1 |
1. Directly use the provided Java SDK as the client for prediction
### Install Java SDK
......@@ -42,6 +42,4 @@ mvn install:install-file -Dfile=$PWD/paddle-serving-sdk-java-0.0.1.jar -DgroupId
2. Use it after compiling from the source code. See the [document](../java/README.md).
3. examples for using the java client, see the See the [document](../java/README.md).
......@@ -8,13 +8,13 @@
| | develop whl | develop bin | stable whl | stable bin |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| cpu-avx-mkl | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-avx-mkl-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.8.3.tar.gz) |
| cpu-avx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-avx-openblas-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.8.3.tar.gz) |
| cpu-noavx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [ serving-cpu-noavx-openblas-0.0.0.tar.gz ]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-noavx-openblas-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.8.3.tar.gz) |
| cuda10.1-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl) | [serving-gpu-101-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl) | [serving-gpu-101-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.8.3.tar.gz) |
| cuda10.2-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [serving-gpu-102-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl) | [serving-gpu-102-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.8.3.tar.gz) |
| cuda10.2-cudnn8-TensorRT7 | [paddle_serving_server_gpu-0.0.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [ serving-gpu-1028-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.8.3.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl) | [serving-gpu-1028-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.8.3.tar.gz ) |
| cuda11.2-cudnn8-TensorRT8 | [paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl) | [ serving-gpu-112-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.8.3.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post112-py3-none-any.whl) | [serving-gpu-112-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.8.3.tar.gz ) |
| cpu-avx-mkl | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.9.0.tar.gz) |
| cpu-avx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.9.0.tar.gz) |
| cpu-noavx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [ serving-cpu-noavx-openblas-0.0.0.tar.gz ]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-noavx-openblas-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.9.0.tar.gz) |
| cuda10.1-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl) | [serving-gpu-101-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post101-py3-none-any.whl) | [serving-gpu-101-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.9.0.tar.gz) |
| cuda10.2-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [serving-gpu-102-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl) | [serving-gpu-102-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.9.0.tar.gz) |
| cuda10.2-cudnn8-TensorRT7 | [paddle_serving_server_gpu-0.0.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [ serving-gpu-1028-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.9.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl) | [serving-gpu-1028-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.9.0.tar.gz ) |
| cuda11.2-cudnn8-TensorRT8 | [paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl) | [ serving-gpu-112-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.9.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post112-py3-none-any.whl) | [serving-gpu-112-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.9.0.tar.gz ) |
### 二进制包(Binary Package)
大多数用户不会用到此章节。但是如果你在无网络的环境下部署Paddle Serving,在首次启动Serving时,无法下载二进制tar文件。因此,提供多种环境二进制包的下载链接,下载后传到无网络环境的指定目录下,即可使用。
......@@ -29,16 +29,16 @@
| | develop whl | stable whl |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Python3.6 | [paddle_serving_client-0.0.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp36-none-any.whl) | [paddle_serving_client-0.8.3-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp36-none-any.whl) |
| Python3.7 | [paddle_serving_client-0.0.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp37-none-any.whl) | [paddle_serving_client-0.8.3-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp37-none-any.whl) |
| Python3.8 | [paddle_serving_client-0.0.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp38-none-any.whl) | [paddle_serving_client-0.8.3-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp38-none-any.whl) |
| Python3.9 | [paddle_serving_client-0.0.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp39-none-any.whl) | [paddle_serving_client-0.8.3-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp38-none-any.whl) |
| Python3.6 | [paddle_serving_client-0.0.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp36-none-any.whl) | [paddle_serving_client-0.9.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp36-none-any.whl) |
| Python3.7 | [paddle_serving_client-0.0.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp37-none-any.whl) | [paddle_serving_client-0.9.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp37-none-any.whl) |
| Python3.8 | [paddle_serving_client-0.0.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp38-none-any.whl) | [paddle_serving_client-0.9.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp38-none-any.whl) |
| Python3.9 | [paddle_serving_client-0.0.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp39-none-any.whl) | [paddle_serving_client-0.9.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp39-none-any.whl) |
## paddle-serving-app Wheel包
| | develop whl | stable whl |
|---------|------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Python3 | [paddle_serving_app-0.0.0-py3-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-none-any.whl) | [ paddle_serving_app-0.8.3-py3-none-any.whl ]( https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.8.3-py3-none-any.whl) |
| Python3 | [paddle_serving_app-0.0.0-py3-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-none-any.whl) | [ paddle_serving_app-0.9.0-py3-none-any.whl ]( https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.9.0-py3-none-any.whl) |
## 百度昆仑芯片
......@@ -62,7 +62,7 @@ https://paddle-serving.bj.bcebos.com/bin/serving-xpu-aarch64-0.0.0.tar.gz
适用于x86 CPU环境的昆仑Wheel包:
```
https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_xpu-0.8.3.post2-py3-none-any.whl
https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_xpu-0.9.0.post2-py3-none-any.whl
```
......
......@@ -8,13 +8,13 @@ Check the following table, and copy the address of hyperlink then run `pip3 inst
| | develop whl | develop bin | stable whl | stable bin |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| cpu-avx-mkl | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-avx-mkl-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.8.3.tar.gz) |
| cpu-avx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-avx-openblas-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.8.3.tar.gz) |
| cpu-noavx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [ serving-cpu-noavx-openblas-0.0.0.tar.gz ]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.8.3-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.8.3-py3-none-any.whl) | [serving-cpu-noavx-openblas-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.8.3.tar.gz) |
| cuda10.1-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl) | [serving-gpu-101-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post101-py3-none-any.whl) | [serving-gpu-101-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.8.3.tar.gz) |
| cuda10.2-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [serving-gpu-102-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl) | [serving-gpu-102-0.8.3.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.8.3.tar.gz) |
| cuda10.2-cudnn8-TensorRT7 | [paddle_serving_server_gpu-0.0.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [ serving-gpu-1028-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.8.3.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post102-py3-none-any.whl) | [serving-gpu-1028-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.8.3.tar.gz ) |
| cuda11.2-cudnn8-TensorRT8 | [paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl) | [ serving-gpu-112-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.0.0.tar.gz ) | [paddle_serving_server_gpu-0.8.3.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.8.3.post112-py3-none-any.whl) | [serving-gpu-112-0.8.3.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.8.3.tar.gz ) |
| cpu-avx-mkl | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-avx-mkl-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-mkl-0.9.0.tar.gz) |
| cpu-avx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-avx-openblas-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-avx-openblas-0.9.0.tar.gz) |
| cpu-noavx-openblas | [paddle_serving_server-0.0.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.0.0-py3-none-any.whl) | [serving-cpu-noavx-openblas-0.0.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.0.0.tar.gz) | [paddle_serving_server-0.9.0-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server-0.9.0-py3-none-any.whl) | [serving-cpu-noavx-openblas-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-cpu-noavx-openblas-0.9.0.tar.gz) |
| cuda10.1-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post101-py3-none-any.whl) | [serving-gpu-101-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post101-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post101-py3-none-any.whl) | [serving-gpu-101-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-101-0.9.0.tar.gz) |
| cuda10.2-cudnn7-TensorRT6 | [paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [serving-gpu-102-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl) | [serving-gpu-102-0.9.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-102-0.9.0.tar.gz) |
| cuda10.2-cudnn8-TensorRT7 | [paddle_serving_server_gpu-0.0.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post102-py3-none-any.whl) | [serving-gpu-1028-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post1028-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post102-py3-none-any.whl) | [serving-gpu-1028-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-1028-0.9.0.tar.gz ) |
| cuda11.2-cudnn8-TensorRT8 | [paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.0.0.post112-py3-none-any.whl) | [serving-gpu-112-0.0.0.tar.gz](https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.0.0.tar.gz) | [paddle_serving_server_gpu-0.9.0.post112-py3-none-any.whl ](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_gpu-0.9.0.post112-py3-none-any.whl) | [serving-gpu-112-0.9.0.tar.gz]( https://paddle-serving.bj.bcebos.com/test-dev/bin/serving-gpu-112-0.9.0.tar.gz ) |
### Binary Package
for most users, we do not need to read this section. But if you deploy your Paddle Serving on a machine without network, you will encounter a problem that the binary executable tar file cannot be downloaded. Therefore, here we give you all the download links for various environment.
......@@ -29,15 +29,15 @@ for most users, we do not need to read this section. But if you deploy your Padd
| | develop whl | stable whl |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|
| Python3.6 | [paddle_serving_client-0.0.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp36-none-any.whl) | [paddle_serving_client-0.8.3-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp36-none-any.whl) |
| Python3.7 | [paddle_serving_client-0.0.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp37-none-any.whl) | [paddle_serving_client-0.8.3-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp37-none-any.whl) |
| Python3.8 | [paddle_serving_client-0.0.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp38-none-any.whl) | [paddle_serving_client-0.8.3-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp38-none-any.whl) |
| Python3.9 | [paddle_serving_client-0.0.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp39-none-any.whl) | [paddle_serving_client-0.8.3-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.8.3-cp38-none-any.whl) |
| Python3.6 | [paddle_serving_client-0.0.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp36-none-any.whl) | [paddle_serving_client-0.9.0-cp36-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp36-none-any.whl) |
| Python3.7 | [paddle_serving_client-0.0.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp37-none-any.whl) | [paddle_serving_client-0.9.0-cp37-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp37-none-any.whl) |
| Python3.8 | [paddle_serving_client-0.0.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp38-none-any.whl) | [paddle_serving_client-0.9.0-cp38-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp38-none-any.whl) |
| Python3.9 | [paddle_serving_client-0.0.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.0.0-cp39-none-any.whl) | [paddle_serving_client-0.9.0-cp39-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_client-0.9.0-cp38-none-any.whl) |
## paddle-serving-app
| | develop whl | stable whl |
|---------|------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Python3 | [paddle_serving_app-0.0.0-py3-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-none-any.whl) | [ paddle_serving_app-0.8.3-py3-none-any.whl ]( https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.8.3-py3-none-any.whl) |
| Python3 | [paddle_serving_app-0.0.0-py3-none-any.whl](https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.0.0-py3-none-any.whl) | [ paddle_serving_app-0.9.0-py3-none-any.whl ]( https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.9.0-py3-none-any.whl) |
## Baidu Kunlun user
......@@ -61,7 +61,7 @@ https://paddle-serving.bj.bcebos.com/bin/serving-xpu-aarch64-0.0.0.tar.gz
for x86 kunlun user
```
https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_xpu-0.8.3.post2-py3-none-any.whl
https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_server_xpu-0.9.0.post2-py3-none-any.whl
```
......
......@@ -4,6 +4,8 @@
低精度部署, 在Intel CPU上支持int8、bfloat16模型,Nvidia TensorRT支持int8、float16模型。
## C++ Serving 部署量化模型
### 通过PaddleSlim量化生成低精度模型
详细见[PaddleSlim量化](https://paddleslim.readthedocs.io/zh_CN/latest/tutorials/quant/overview.html)
......@@ -41,7 +43,12 @@ fetch_map = client.predict(feed={"image": img}, fetch=["score"])
print(fetch_map["score"].reshape(-1))
```
### 参考文档
## Python Pipeline 部署量化模型
请参考 [Python Pipeline 低精度推理](./Python_Pipeline/Pipeline_Features_CN.md#低精度推理)
## 参考文档
* [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
* PaddleInference Intel CPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_x86_cpu_int8.html)
* PaddleInference NV GPU部署量化模型[文档](https://paddle-inference.readthedocs.io/en/latest/optimize/paddle_trt.html)
# Model Zoo
([English](./Model_Zoo_EN.md)|简体中文)
本页面展示了Paddle Serving目前支持的预训练模型以及下载链接
若您想为Paddle Serving提供新的模型,可通过[pull request](https://github.com/PaddlePaddle/Serving/pulls)提交PR
特别感谢[Padddle wholechain](https://www.paddlepaddle.org.cn/wholechain)以及[PaddleHub](https://www.paddlepaddle.org.cn/hub)为Paddle Serving提供的部分预训练模型
| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | ---- |
| pp_shitu | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/pp_shitu) | [.tar.gz](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/serving/pp_shitu.tar.gz) |
| resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
| bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| ELECTRA | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/electra/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/electra) |
| In-batch Negatives | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative) | [model](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip) |
| criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
| blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
| cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
| fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
| ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
| PPTSN_K400 | PaddleVideo | [Pipeline Serving](../examples/Pipeline/PaddleVideo/PPTSN_K400) | [model](https://paddle-serving.bj.bcebos.com/model/PaddleVideo/PPTSN_K400.tar) |
- 请参考 [example](../examples) 查看详情
- 更多Paddle Serving支持的部署模型请参考[wholechain](https://www.paddlepaddle.org.cn/wholechain)
- 最新模型可参考
# 模型库
- [模型分类](#1)
- [1.1 图像分类与识别](#1.1)
- [1.2 文本类](#1.2)
- [1.3 推荐系统](#1.3)
- [1.4 人脸识别](#1.4)
- [1.5 目标检测](#1.5)
- [1.6 文字识别](#1.6)
- [1.7 图像分割](#1.7)
- [1.8 关键点检测](#1.8)
- [1.9 视频理解](#1.9)
- [模型示例库](#2)
Paddle Serving 已实现9个类别,共计46个模型的服务化部署示例。
<a name="1"></a>
## 模型分类
<a name="1.1"></a>
**一.图像分类与识别**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 图像识别 |pp_shitu | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/pp_shitu) | [.tar.gz](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/serving/pp_shitu.tar.gz) |
| 图像分类 | resnet_v2_50_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/resnet_v2_50)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet_V2_50) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/resnet_v2_50_imagenet.tar.gz) | Pipeline Serving, C++ Serving|
| 图像分类 |mobilenet_v2_imagenet | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/mobilenet) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageClassification/mobilenet_v2_imagenet.tar.gz) |
| 图像分类 |resnet50_vd | PaddleClas | [C++ Serving](../examples/C++/PaddleClas/imagenet)</br>[Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd) | [.tar.gz](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd.tar) |
| 图像分类 |ResNet50_vd_KL | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_KL) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_KL.tar) |
| 图像分类 |ResNet50_vd_FPGM | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_FPGM) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_FPGM.tar) |
| 图像分类 |ResNet50_vd_PACT | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNet50_vd_PACT) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNet50_vd_PACT.tar) |
| 图像分类 |ResNeXt101_vd_64x4d | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ResNeXt101_vd_64x4d) | [.tar](https://paddle-serving.bj.bcebos.com/model/ResNeXt101_vd_64x4d.tar) |
| 图像分类 |DarkNet53 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/DarkNet53) | [.tar](https://paddle-serving.bj.bcebos.com/model/DarkNet53.tar) |
| 图像分类 |MobileNetV1 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV1) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV1.tar) |
| 图像分类 |MobileNetV2 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV2) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV2.tar) |
| 图像分类 |MobileNetV3_large_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/MobileNetV3_large_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/MobileNetV3_large_x1_0.tar) |
| 图像生成 |HRNet_W18_C | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/HRNet_W18_C) | [.tar](https://paddle-serving.bj.bcebos.com/model/HRNet_W18_C.tar) |
| 图像分类 |ShuffleNetV2_x1_0 | PaddleClas | [Pipeline Serving](../examples/Pipeline/PaddleClas/ShuffleNetV2_x1_0) | [.tar](https://paddle-serving.bj.bcebos.com/model/ShuffleNetV2_x1_0.tar) |
---
<a name="1.2"></a>
**二.文本类**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 文本生成 | bert_chinese_L-12_H-768_A-12 | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/bert)</br>[Pipeline Serving](../examples/Pipeline/PaddleNLP/bert) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz) |
| 情感分析 |senta_bilstm | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/senta) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SentimentAnalysis/senta_bilstm.tar.gz) |C++ Serving|
| 词法分析 |lac | PaddleNLP | [C++ Serving](../examples/C++/PaddleNLP/lac) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/LexicalAnalysis/lac.tar.gz) |
| 机器翻译 |transformer | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/machine_translation/transformer/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/machine_translation/transformer) |
| 标点符号预测 | ELECTRA | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/examples/language_model/electra/deploy/serving/README.md) | [model](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/language_model/electra) |
| 抽取文本向量| In-batch Negatives | PaddleNLP | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/in_batch_negative) | [model](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip) |
---
<a name="1.3"></a>
**三.推荐系统**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| CTR预估 | criteo_ctr | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/criteo_ctr_example/criteo_ctr_demo_model.tar.gz) |
| CTR预估 | criteo_ctr_with_cube | PaddleRec | [C++ Serving](../examples/C++/PaddleRec/criteo_ctr_with_cube) | [.tar.gz](https://paddle-serving.bj.bcebos.com/unittest/ctr_cube_unittest.tar.gz) |
| 内容推荐 | wide&deep | PaddleRec | [C++ Serving](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/doc/serving.md) | [model](https://github.com/PaddlePaddle/PaddleRec/blob/release/2.1.0/models/rank/wide_deep/README.md) |
---
<a name="1.4"></a>
**四.人脸识别**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 人脸识别|blazeface | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/blazeface) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/blazeface.tar.gz) |C++ Serving|
---
<a name="1.5"></a>
**五.目标检测**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 目标检测 |cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/cascade_rcnn) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/cascade_mask_rcnn_r50_vd_fpn_ssld_2x_coco_serving.tar.gz) |
| 目标检测 | yolov4 | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov4) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ObjectDetection/yolov4.tar.gz) |C++ Serving|
| 目标检测 |fcos_dcn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/fcos_dcn_r50_fpn_1x_coco) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/fcos_dcn_r50_fpn_1x_coco.tar) |
| 目标检测 | ssd_vgg16_300_240e_voc | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ssd_vgg16_300_240e_voc) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ssd_vgg16_300_240e_voc.tar) |
| 目标检测 |yolov3_darknet53_270e_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/yolov3_darknet53_270e_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/yolov3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/yolov3_darknet53_270e_coco.tar) |
| 目标检测 | faster_rcnn_r50_fpn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_r50_fpn_1x_coco)</br>[Pipeline Serving](../examples/Pipeline/PaddleDetection/faster_rcnn) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/faster_rcnn_r50_fpn_1x_coco.tar) |
| 目标检测 |ppyolo_r50vd_dcn_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ppyolo_r50vd_dcn_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_r50vd_dcn_1x_coco.tar) |
| 目标检测 | ppyolo_mbv3_large_coco | PaddleDetection | [Pipeline Serving](../examples/Pipeline/PaddleDetection/ppyolo_mbv3) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/2.0/ppyolo_mbv3_large_coco.tar) |
| 目标检测 | ttfnet_darknet53_1x_coco | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/ttfnet_darknet53_1x_coco) | [.tar](https://paddle-serving.bj.bcebos.com/pddet_demo/ttfnet_darknet53_1x_coco.tar) |
| 目标检测 |YOLOv3-DarkNet | PaddleDetection | [C++ Serving](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.3/deploy/serving) | [.pdparams](https://paddledet.bj.bcebos.com/models/yolov3_darknet53_270e_coco.pdparams)</br>[.yml](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/configs/yolov3/yolov3_darknet53_270e_coco.yml) |
---
<a name="1.6"></a>
**六.文字识别**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 文字识别 |ocr_rec | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/OCR/ocr_rec.tar.gz) |
| 文字识别 |ocr_det | PaddleOCR | [C++ Serving](../examples/C++/PaddleOCR/ocr)</br>[Pipeline Serving](../examples/Pipeline/PaddleOCR/ocr) | [.tar.gz](https://paddle-serving.bj.bcebos.com/ocr/ocr_det.tar.gz) |
| 文字识别 |ch_ppocr_mobile_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml) |
| 文字识别 |ch_ppocr_server_v2.0_det | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/det/ch_ppocr_v2.0/ch_det_res18_db_v2.0.yml) |
| 文字识别 |ch_ppocr_mobile_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |
| 文字识别 |ch_ppocr_server_v2.0_rec | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar)</br>[.yml](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/configs/rec/ch_ppocr_v2.0/rec_chinese_common_train_v2.0.yml) |
| 文字识别 |ch_ppocr_mobile_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
| 文字识别 |ch_ppocr_server_v2.0 | PaddleOCR | [Pipeline Serving](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/deploy/pdserving/README.md) | [model](https://github.com/PaddlePaddle/PaddleOCR) |
---
<a name="1.7"></a>
**七.图像分割**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 图像分割 | deeplabv3 | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/deeplabv3) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/deeplabv3.tar.gz) |
| 图像分割 | unet | PaddleSeg | [C++ Serving](../examples/C++/PaddleSeg/unet_for_image_seg) | [.tar.gz](https://paddle-serving.bj.bcebos.com/paddle_hub_models/image/ImageSegmentation/unet.tar.gz) |
---
<a name="1.8"></a>
**八.关键点检测**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 关键点检测 |faster_rcnn_hrnetv2p_w18_1x | PaddleDetection | [C++ Serving](../examples/C++/PaddleDetection/faster_rcnn_hrnetv2p_w18_1x) | [.tar.gz](https://paddle-serving.bj.bcebos.com/pddet_demo/faster_rcnn_hrnetv2p_w18_1x.tar.gz) |
---
<a name="1.9"></a>
**九.视频理解**
模型部署示例请参阅下表:
| 场景| 模型 | 类型 | 示例使用的框架 | 下载 |
| --- | --- | --- | --- | ---- |
| 视频理解 |PPTSN_K400 | PaddleVideo | [Pipeline Serving](../examples/Pipeline/PaddleVideo/PPTSN_K400) | [model](https://paddle-serving.bj.bcebos.com/model/PaddleVideo/PPTSN_K400.tar) |
---
<a name="2"></a>
## 模型示例库
Paddle Serving 代码库下模型部署示例请参考 [examples](../examples) 目录。更多 Paddle Serving 部署模型请参考 [wholechain](https://www.paddlepaddle.org.cn/wholechain)
了解最新模型,请进入 Paddle 模型套件库:
- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
......
# Pipeline Serving 性能优化
([English](./Performance_Tuning_EN.md)|简体中文)
## 1. 性能分析与优化
### 1.1 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 1.2 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制:
```yaml
dag:
use_profile: true
```
开启该功能后,Server 端在预测的过程中会将对应的日志信息打印到标准输出,为了更直观地展现各阶段的耗时,提供 Analyst 模块对日志文件做进一步的分析处理。
使用时先将 Server 的输出保存到文件,以 `profile.txt` 为例,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件,`trace` 文件可以通过 chrome 浏览器的 tracing 功能进行可视化。
```python
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 1.3 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 1.4 分析方法
根据pipeline.tracer日志中的各个阶段耗时,按以下公式逐步分析出主要耗时在哪个阶段。
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
Channel堆积:
channel_acc_size = QPS(down - up) * time
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 优化思路
根据长耗时在不同阶段,采用不同的优化方法.
- OP推理阶段(mid-process):
- 增加OP并发度
- 开启auto-batching(前提是多个请求的shape一致)
- 若批量数据中某条数据的shape很大,padding很大导致推理很慢,可使用mini-batch
- 开启TensorRT/MKL-DNN优化
- 开启低精度推理
- OP前处理阶段(pre-process):
- 增加OP并发度
- 优化前处理逻辑
- in/out耗时长(channel堆积>5)
- 检查channel传递的数据大小和延迟
- 优化传入数据,不传递数据或压缩后再传入
- 增加OP并发度
- 减少上游OP并发度
# Pipeline Serving Performance Optimization
(English|[简体中文](./Performance_Tuning_CN.md))
## 1. Performance analysis and optimization
### 1.1 How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### 1.2 Output profile information on server side
The server is controlled by the `use_profile` field in yaml:
```yaml
dag:
use_profile: true
```
After the function is enabled, the server will print the corresponding log information to the standard output in the process of prediction. In order to show the time consumption of each stage more intuitively, Analyst module is provided for further analysis and processing of log files.
The output of the server is first saved to a file. Taking `profile.txt` as an example, the script converts the time monitoring information in the log into JSON format and saves it to the `trace` file. The `trace` file can be visualized through the tracing function of Chrome browser.
```shell
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### 1.3 Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 1.4 Analytical methods
According to the time consumption of each stage in the pipeline.tracer log, the following formula is used to gradually analyze which stage is the main time consumption.
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
### 1.5 Optimization ideas
According to the long time consuming in stages below, different optimization methods are adopted.
- OP Inference stage(mid-process):
- Increase `concurrency`
- Turn on `auto-batching`(Ensure that the shapes of multiple requests are consistent)
- Use `mini-batch`, If the shape of data is very large.
- Turn on TensorRT for GPU
- Turn on MKLDNN for CPU
- Turn on low precison inference
- OP preprocess or postprocess stage:
- Increase `concurrency`
- Optimize processing logic
- In/Out stage(channel accumulation > 5):
- Check the size and delay of the data passed by the channel
- Optimize the channel to transmit data, do not transmit data or compress it before passing it in
- Increase `concurrency`
- Decrease `concurrency` upstreams.
本次提测的Serving版本,支持GPU预测,希望以此任务为例,对Paddle Serving支持GPU预测的性能给出测试数据。
# Python Pipeline 性能测试
## 1. 测试环境说明
- [测试环境](#1)
- [性能指标与结论](#2)
<a name="1"></a>
## 测试环境
测试环境如下表所示:
| | GPU | 显存 | CPU | 内存 |
|----------|---------|----------|----------------------------------------------|------|
| Serving端 | 4x Tesla P4-8GB | 7611MiB | Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz 48核 | 216G |
......@@ -10,7 +16,15 @@
使用单卡GPU,未开启TensorRT。
模型:ResNet_v2_50
## 2. PaddleServing-PipeLine(python)
<a name="2"></a>
## 性能指标与结论
通过测试,使用 Python Pipeline 模式通过多进程并发,充分利用 GPU 显卡,具有较好的吞吐性能。
测试数据如下:
|model_name |thread_num |batch_size |CPU_util(%) |GPU_memory(mb) |GPU_util(%) |qps(samples/s) |total count |mean(ms) |median(ms) |80 percent(ms) |90 percent(ms) |99 percent(ms) |total cost(s) |each cost(s)|
|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--
|ResNet_v2_50 |1 |1 |2.2 |3327 |17.25 |17.633658869240787 |355 |56.428481238996476 |38.646728515625 |39.496826171875 |39.98369140625 |1273.1911083984373 |20.131953477859497 |20.033540725708008|
......
# Pipeline Serving
# Python Pipeline 框架设计
(简体中文|[English](Pipeline_Design_EN.md))
- [架构设计](Pipeline_Design_CN.md#1架构设计)
- [详细设计](Pipeline_Design_CN.md#2详细设计)
- [典型示例](Pipeline_Design_CN.md#3典型示例)
- [高阶用法](Pipeline_Design_CN.md#4高阶用法)
- [日志追踪](Pipeline_Design_CN.md#5日志追踪)
- [性能分析与优化](Pipeline_Design_CN.md#6性能分析与优化)
- [目标](#1)
- [框架设计](#2)
- [2.1 网络层设计](#2.1)
- [2.2 图执行引擎层](#2.2)
- [2.3 服务日志](#2.3)
- [2.4 错误信息](#2.4)
- [自定义信息](#3)
- [3.1 自定义 Web 服务 URL](#3.1)
- [3.2 自定义服务输入和输出结构](#3.2)
- [3.3 自定义服务并发和模型配置](#3.3)
- [3.4 自定义推理过程](#3.4)
- [3.5 自定义业务错误类型](#3.5)
在许多深度学习框架中,Serving通常用于单模型的一键部署。在AI工业大生产的背景下,端到端的深度学习模型当前还不能解决所有问题,多个深度学习模型配合起来使用还是解决现实问题的常规手段。但多模型应用设计复杂,为了降低开发和维护难度,同时保证服务的可用性,通常会采用串行或简单的并行方式,但一般这种情况下吞吐量仅达到可用状态,而且GPU利用率偏低。
<a name="1"></a>
Paddle Serving提供了用户友好的多模型组合服务编程框架,Pipeline Serving,旨在降低编程门槛,提高资源使用率(尤其是GPU设备),提升整体的预估效率。
## 目标
为了解决多个深度学习模型组合的复杂问题,Paddle Serving 团队设计了一个通用端到端多模型组合框架,其核心特点包括:
1. 通用性:框架既要满足通用模型的输入类型,又要满足模型组合的复杂拓扑关系。
2. 高性能:与常见互联网后端服务不同,深度学习模型的推理程序属于计算密集型程序,同时 GPU 等计算芯片价格昂贵,因此在平均响应时间不苛刻的场景下,计算资源占用和吞吐量指标格外重要。
3. 高可用性:高可用的架构依赖每个服务的健壮性,服务状态可查询、异常可监控和管理是必备条件。
4. 易于开发与调试:使用 Python 语言开发可大幅提升研发效率,运行的错误信息准确帮助开发者快速定位问题。
## 1.架构设计
<a name="2"></a>
Server端基于<b>RPC服务层</b><b>图执行引擎</b>构建,两者的关系如下图所示。
## 框架设计
Python Pipeline 框架分为网络服务层和图执行引擎2部分,网络服务层处理多种网络协议请求和通用输入参数问题,图执行引擎层解决复杂拓扑关系。如下图所示
<div align=center>
<img src='../images/pipeline_serving-image1.png' height = "250" align="middle"/>
</div>
</n>
<a name="2.1"></a>
### 1.1 RPC服务层
**一.网络服务层**
为满足用户不同的使用需求,RPC服务层同时启动1个Web服务器和1个RPC服务器,可同时处理RESTful API、gRPC 2种类型请求。gPRC gateway接收RESTful API请求通过反向代理服务器将请求转发给gRPC Service;gRPC请求由gRPC service接收,所以,2种类型的请求统一由gRPC Service处理,确保处理逻辑一致
网络服务层包括了 gRPC-gateway 和 gRPC Server。gPRC gateway 接收 HTTP 请求,打包成 proto 格式后转发给 gRPC Server,一套处理程序可同时处理 HTTP、gRPC 2种类型请求
#### <b>1.1.1 proto的输入输出结构</b>
gRPC服务和gRPC gateway服务统一用service.proto生成。
另外,在支持多种模型的输入输出数据类型上,使用统一的 service.proto 结构,具有更好的通用性。
```proto
message Request {
......@@ -50,100 +59,60 @@ message Response {
repeated string value = 4;
};
```
Request`key``value`是配对的string数组用于接收数据。 `name``method`对应RESTful API的URL://{ip}:{port}/{name}/{method}。`logid``clientip`便于用户串联服务级请求和自定义策略。
Request 是输入结构,`key``value` 是配对的 string 数组。 `name``method` 对应 URL://{ip}:{port}/{name}/{method}。`logid``clientip` 便于用户串联服务级请求和自定义策略。
Response`err_no``err_msg`表达处理结果的正确性和错误信息,`key``value`为返回结果。
Response 是输出结构,`err_no``err_msg` 表达处理结果的正确性和错误信息,`key``value`结果。
Pipeline 服务包装了继承于 WebService 类,以 [OCR 示例](https://github.com/PaddlePaddle/Serving/tree/develop/examples/Pipeline/PaddleOCR/ocr)为例,派生出 OcrService 类,get_pipeline_response 函数内实现 DAG 拓扑关系,默认服务入口为 read_op,函数返回的 Op 为最后一个处理,此处要求最后返回的 Op 必须唯一。
### 1.2 图执行引擎
所有服务和模型的所有配置信息在 `config.yml` 中记录,URL 的 name 字段由 OcrService 初始化定义;run_service 函数启动服务。
图执行引擎由 OP 和 Channel 构成,相连接的 OP 之间会共享一个 Channel。
```python
class OcrService(WebService):
def get_pipeline_response(self, read_op):
det_op = DetOp(name="det", input_ops=[read_op])
rec_op = RecOp(name="rec", input_ops=[det_op])
return rec_op
ocr_service = OcrService(name="ocr")
ocr_service.prepare_pipeline_config("config.yml")
ocr_service.run_service()
```
- Channel 可以理解为一个缓冲队列。每个 OP 只接受一个 Channel 的输入和多个 Channel 的输出(每个输出相同);一个 Channel 可以包含来自多个 OP 的输出,同一个 Channel 的数据可以作为多个 OP 的输入Channel
- 用户只需要定义 OP 间的关系,在编译期图引擎负责分析整个图的依赖关系,并声明Channel
- Request 进入图执行引擎服务后会产生一个 Request Id,Reponse 会通过 Request Id 进行对应的返回
- 对于 OP 之间需要传输过大数据的情况,可以考虑 RAM DB 外存进行全局存储,通过在 Channel 中传递索引的 Key 来进行数据传输
<div align=center>
<img src='../images/pipeline_serving-image2.png' height = "300" align="middle"/>
</div>
与网络框架相关的配置在 `config.yml` 中设置。其中 `worker_num` 表示框架主线程 gRPC 线程池工作线程数,可理解成网络同步线程并发数。
其次,`rpc_port``http_port` 是服务端口,可同时开启,不允许同时为空。
```
worker_num: 10
#### <b>1.2.1 OP的设计</b>
# http 和 gRPC 服务端口
rpc_port: 9988
http_port: 18089
```
- 单个 OP 默认的功能是根据输入的 Channel 数据,访问一个 Paddle Serving 的单模型服务,并将结果存在输出的 Channel
- 单个 OP 可以支持用户自定义,包括 preprocess,process,postprocess 三个函数都可以由用户继承和实现
- 单个 OP 可以控制并发数,从而增加处理并发数
- 单个 OP 可以获取多个不同 RPC 请求的数据,以实现 Auto-Batching
- OP 可以由线程或进程启动
<a name="2.2"></a>
#### <b>1.2.2 Channel的设计</b>
**二.图执行引擎层**
- Channel 是 OP 之间共享数据的数据结构,负责共享数据或者共享数据状态信息
- Channel 可以支持多个OP的输出存储在同一个 Channel,同一个 Channel 中的数据可以被多个 OP 使用
- 下图为图执行引擎中 Channel 的设计,采用 input buffer 和 output buffer 进行多 OP 输入或多 OP 输出的数据对齐,中间采用一个 Queue 进行缓冲
图执行引擎的设计思路是基于有向无环图实现多模型组合的复杂拓扑关系,有向无环图由单节点或多节点串联、并联结构构成。
<div align=center>
<img src='../images/pipeline_serving-image3.png' height = "500" align="middle"/>
<img src='../images/pipeline_serving-image2.png' height = "300" align="middle"/>
</div>
#### <b>1.2.3 预测类型的设计</b>
- OP的预测类型(client_type)有3种类型,brpc、grpc和local_predictor,各自特点如下:
- brpc: 使用bRPC Client与远端的Serving服务网络交互,性能优于grpc,但仅支持Linux平台
- grpc: 使用gRPC Client与远端的Serving服务网络交互,支持跨操作系统部署,性能弱于bprc
- local_predictor: 本地服务内加载模型并完成预测,不需要网络交互,延时更低,支持Linux部署。支持本机多卡部署和TensorRT实现高性能预测。
- 选型:
- 延时(越少越好): local_predictor < brpc <= grpc
- 操作系统:grpc > local_precitor >= brpc
- 微服务: brpc或grpc模型分拆成独立服务,简化开发和部署复杂度,提升资源利用率
#### <b>1.2.4 极端情况的考虑</b>
- `请求超时的处理`
整个图执行引擎每一步都有可能发生超时,图执行引擎里面通过设置 timeout 值来控制,任何环节超时的请求都会返回超时响应。
- `Channel 存储的数据过大`
Channel 中可能会存储过大的数据,导致拷贝等耗时过高,图执行引擎里面可以通过将 OP 计算结果数据存储到外存,如高速的内存 KV 系统
图执行引擎抽象归纳出2种数据结构 Op 节点和 Channel 有向边,构建一条异步流水线工作流。核心概念和设计思路如下:
- Op 节点: 可理解成1个推理模型、一个处理方法,甚至是训练前向代码,可独立运行,独立设置并发度。每个 Op 节点的计算结果放入其绑定的 Channel 中。
- Channel 数据管道: 可理解为一个单向缓冲队列。每个 Channel 只接收上游 Op 节点的计算输出,作为下游 Op 节点的输入。
- 工作流:根据用户定义的节点依赖关系,图执行引擎自动生成有向无环图。每条用户请求到达图执行引擎时会生成一个唯一自增 ID,通过这种唯一性绑定关系标记流水线中的不同请求。
- `Channel 设计中的 input buffer 和 output buffer 是否会无限增加`
Op 的设计原则:
- 单个 Op 默认的功能是根据输入的 Channel 数据,访问一个 Paddle Serving 的单模型服务,并将结果存在输出的 Channel
- 单个 Op 可以支持用户自定义,包括 preprocess,process,postprocess 三个函数都可以由用户继承和实现
- 单个 Op 可以控制并发数,从而增加处理并发数
- 单个 Op 可以获取多个不同 RPC 请求的数据,以实现 Auto-Batching
- Op 可以由线程或进程启动
- 不会。整个图执行引擎的输入会放到一个 Channel 的 internal queue 里面,直接作为整个服务的流量控制缓冲队列
- 对于 input buffer,根据计算量的情况调整 OP1 和 OP2 的并发数,使得 input buffer 来自各个输入 OP 的数量相对平衡(input buffer 的长度取决于 internal queue 中每个 item 完全 ready 的速度)
- 对于 output buffer,可以采用和 input buffer 类似的处理方法,即调整 OP3 和 OP4 的并发数,使得 output buffer 的缓冲长度得到控制(output buffer 的长度取决于下游 OP 从 output buffer 获取数据的速度)
- 同时 Channel 中数据量不会超过 gRPC 的 `worker_num`,即线程池大小
***
## 2.详细设计
对于Pipeline的设计实现,首先介绍PipelineServer、OP、重写OP前后处理,最后介绍特定OP(RequestOp和ResponseOp)二次开发的方法。
### 2.1 PipelineServer定义
PipelineServer包装了RPC运行层和图引擎执行,所有Pipeline服务首先要实例化PipelineServer示例,再设置2个核心接口 set_response_op、加载配置信息,最后调用run_server启动服务。代码示例如下:
```python
server = PipelineServer()
server.set_response_op(response_op)
server.prepare_server(config_yml_path)
#server.prepare_pipeline_config(config_yml_path)
server.run_server()
```
PipelineServer的核心接口:
- `set_response_op`,设置response_op 将会根据各个 OP 的拓扑关系初始化 Channel 并构建计算图。
- `prepare_server`: 加载配置信息,并启动远端Serving服务,适用于调用远端远端推理服务
- `prepare_pipeline_config`,仅加载配置信息,适用于local_prdict
- `run_server`,启动gRPC服务,接收请求
### 2.2 OP 定义
普通 OP 作为图执行引擎中的基本单元,其构造函数如下:
其构造函数如下:
```python
def __init__(name=None,
......@@ -160,37 +129,233 @@ def __init__(name=None,
local_service_handler=None)
```
各参数含义如下
各参数含义如下
| 参数名 | 类型 | 含义 |
| :-------------------: | :---------: |:------------------------------------------------: |
| name | (str) | 用于标识 OP 类型的字符串,该字段必须全局唯一。 |
| input_ops | (list) | 当前 OP 的所有前继 OP 的列表。 |
| name | (str) | 用于标识 Op 类型的字符串,该字段必须全局唯一。 |
| input_ops | (list) | 当前 Op 的所有前继 Op 的列表。 |
| server_endpoints | (list) |远程 Paddle Serving Service 的 endpoints 列表。如果不设置该参数,认为是local_precditor模式,从local_service_conf中读取配置。 |
| fetch_list | (list) |远程 Paddle Serving Service 的 fetch 列表。 |
| client_config | (str) |Paddle Serving Service 对应的 Client 端配置文件路径。 |
| client_type | (str) |可选择brpc、grpc或local_predictor。local_predictor不启动Serving服务,进程内预测。 |
| concurrency | (int) | OP 的并发数。 |
| concurrency | (int) | Op 的并发数。 |
| timeout | (int) |process 操作的超时时间,单位为毫秒。若该值小于零,则视作不超时。 |
| retry | (int) |超时重试次数。当该值为 1 时,不进行重试。 |
| batch_size | (int) |进行 Auto-Batching 的期望 batch_size 大小,由于构建 batch 可能超时,实际 batch_size 可能小于设定值,默认为 1。 |
| auto_batching_timeout | (float) |进行 Auto-Batching 构建 batch 的超时时间,单位为毫秒。batch_size > 1时,要设置auto_batching_timeout,否则请求数量不足batch_size时会阻塞等待。 |
| local_service_handler | (object) |local predictor handler,Op init()入参赋值 或 在Op init()中创建|
| local_service_handler | (object) |local predictor handler,Op init() 入参赋值或在 Op init() 中创建|
对于 Op 之间需要传输过大数据的情况,可以考虑 RAM DB 外存进行全局存储,通过在 Channel 中传递索引的 Key 来进行数据传输
Channel的设计原则:
- Channel 是 Op 之间共享数据的数据结构,负责共享数据或者共享数据状态信息
- Channel 可以支持多个OP的输出存储在同一个 Channel,同一个 Channel 中的数据可以被多个 Op 使用
下图为图执行引擎中 Channel 的设计,采用 input buffer 和 output buffer 进行多 Op 输入或多 Op 输出的数据对齐,中间采用一个 Queue 进行缓冲
<div align=center>
<img src='../images/pipeline_serving-image3.png' height = "500" align="middle"/>
</div>
<a name="2.3"></a>
**三.服务日志**
Pipeline 服务日志在当前目录的 `PipelineServingLogs` 目录下,有3种类型日志,分别是 `pipeline.log``pipeline.log.wf``pipeline.tracer`
- `pipeline.log` : 记录 debug & info日志信息
- `pipeline.log.wf` : 记录 warning & error日志
- `pipeline.tracer` : 统计各个阶段耗时、channel 堆积信息
```
├── config.yml
├── get_data.sh
├── PipelineServingLogs
│   ├── pipeline.log
│   ├── pipeline.log.wf
│   └── pipeline.tracer
├── README_CN.md
├── README.md
├── uci_housing_client
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── uci_housing_model
│   ├── fc_0.b_0
│   ├── fc_0.w_0
│   ├── __model__
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── web_service_java.py
└── web_service.py
```
在服务发生异常时,错误信息会记录在 pipeline.log.wf 日志中。打印 tracer 日志要求在 config.yml 的 DAG 属性中添加 tracer 配置。
1. 日志与请求的唯一标识
Pipeline 中有2种 id 用以串联请求,分别是 data_id 和 log_id,二者区别如下:
- data_id : Pipeline 框架生成的自增 ID,标记请求唯一性标识
- log_id : 上游模块传入的标识,跟踪多个服务间串联关系,由于用户可不传入或不保证唯一性,因此不能作为唯一性标识
通常,Pipeline 框架打印的日志会同时带上 data_id 和 log_id。开启 auto-batching 后,会使用批量中的第一个 data_id 标记 batch 整体,同时框架会在一条日志中打印批量中所有 data_id。
### 2.3 重写OP前后处理
OP 二次开发的目的是满足业务开发人员控制OP处理策略。
2. 日志滚动
Pipeline 的日志模块在 `logger.py` 中定义,使用了 `logging.handlers.RotatingFileHandler` 支持磁盘日志文件的轮换。根据不同文件级别和日质量分别设置了 `maxBytes``backupCount`,当即将超出预定大小时,将关闭旧文件并打开一个新文件用于输出。
```python
"handlers": {
"f_pipeline.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log"),
"maxBytes": 512000000,
"backupCount": 20,
},
"f_pipeline.log.wf": {
"class": "logging.handlers.RotatingFileHandler",
"level": "WARNING",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log.wf"),
"maxBytes": 512000000,
"backupCount": 10,
},
"f_tracer.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "tracer_fmt",
"filename": os.path.join(log_dir, "pipeline.tracer"),
"maxBytes": 512000000,
"backupCount": 5,
},
}
```
<a name="2.4"></a>
**四. 错误信息**
框架提供的错误信息如下所示, 完整信息在 `error_catch.py``CustomExceptionCode` 类中定义。
| 错误码 | 说明 |
| :---: | :-------------: |
| 0 | 成功 |
| 50 ~ 999 | 产品错误 |
| 3000 ~ 3999 | 框架内部服务错误 |
| 4000 ~ 4999 | 配置错误 |
| 5000 ~ 5999 | 用户输入错误 |
| 6000 ~ 6999 | 超时错误 |
| 7000 ~ 7999 | 类型检查错误 |
| 8000 ~ 8999 | 内部通讯错误 |
| 9000 ~ 9999 | 推理错误 |
| 10000 ~ | 其他错误 |
具体错误信息如下:
```
class CustomExceptionCode(enum.Enum):
OK = 0
PRODUCT_ERROR = 50
NOT_IMPLEMENTED = 3000
CLOSED_ERROR = 3001
NO_SERVICE = 3002
INIT_ERROR = 3003
CONF_ERROR = 4000
INPUT_PARAMS_ERROR = 5000
TIMEOUT = 6000
TYPE_ERROR = 7000
RPC_PACKAGE_ERROR = 8000
CLIENT_ERROR = 9000
UNKNOW = 10000
```
<a name="3"></a>
## 自定义信息
提供给开发者提供以下自定义信息,包括自定义 Web 服务、自定义服务输入和输出结构、自定义服务并发和模型配置和自定义推理过程
- 自定义 Web 服务 URL
- 自定义服务输入和输出结构
- 自定义服务并发和模型配置
- 自定义推理过程
- 自定义业务错误类型
<a name="3.1"></a>
**一.自定义 Web 服务 URL**
在 Web 服务中自定义服务名称是常见操作,尤其是将已有服务迁移到新框架。URL 中核心字段包括 `ip``port``name``method`,根据最新部署的环境信息设置前2个字段,重点介绍如何设置 `name``method`,框架提供默认的 `methon``prediciton`,如 `http://127.0.0.1:9999/ocr/prediction`
框架有2处代码与此相关,分别是 gRPC Gateway 的配置文件 `python/pipeline/gateway/proto/gateway.proto` 和 服务启动文件 `web_server.py`
业务场景中通过设置 `name` 和 验证 `method` 解决问题。以 [OCR 示例]()为例,服务启动文件 `web_server.py` 通过类 `OcrService` 构造函数的 `name` 字段设置 URL 中 `name` 字段;
```
ocr_service = OcrService(name="ocr")
ocr_service.prepare_pipeline_config("config.yml")
ocr_service.run_service()
```
框架提供默认的 `methon``prediciton`,通过重载 `RequestOp::unpack_request_package` 来验证 `method`
```
def unpack_request_package(self, request):
dict_data = {}
log_id = None
if request is None:
_LOGGER.critical("request is None")
raise ValueError("request is None")
if request.method is not "prediction":
_LOGGER.critical("request method error")
raise ValueError("request method error")
...
```
`python/pipeline/gateway/proto/gateway.proto` 文件可以对 `name``method` 做严格限制,一般不需要修改,如需要特殊指定修改后,需要重新编译 Paddle Serving,[编译方法]()
```proto
service PipelineService {
rpc inference(Request) returns (Response) {
option (google.api.http) = {
post : "/{name=*}/{method=*}"
body : "*"
};
}
};
```
<a name="3.2"></a>
**二.自定义服务输入和输出结构**
输入和输出结构包括 proto 中 Request 和 Response 结构,以及 Op 前后处理返回。
当默认 proto 结构不满足业务需求时,同时下面2个文件的 proto 的 Request 和 Response message 结构,保持一致。
- pipeline/gateway/proto/gateway.proto
- pipeline/proto/pipeline_service.proto
修改后,需要重新[编译 Paddle Serving](../Compile_CN.md)
<a name="3.3"></a>
**三.自定义服务并发和模型配置**
完整的配置信息可参考[配置信息](../Serving_Configure_CN.md)
<a name="3.4"></a>
**四.自定义推理过程**
推理 Op 为开发者提供3个外部函数接口:
| 变量或接口 | 说明 |
| :----------------------------------------------: | :----------------------------------------------------------: |
| def preprocess(self, input_dicts) | 对从 Channel 中获取的数据进行处理,处理完的数据将作为 **process** 函数的输入。(该函数对一个 **sample** 进行处理) |
| def process(self, feed_dict_list, typical_logid) | 基于 Paddle Serving Client 进行 RPC 预测,处理完的数据将作为 **postprocess** 函数的输入。(该函数对一个 **batch** 进行处理) |
| def postprocess(self, input_dicts, fetch_dict) | 处理预测结果,处理完的数据将被放入后继 Channel 中,以被后继 OP 获取。(该函数对一个 **sample** 进行处理) |
| def postprocess(self, input_dicts, fetch_dict) | 处理预测结果,处理完的数据将被放入后继 Channel 中,以被后继 Op 获取。(该函数对一个 **sample** 进行处理) |
| def init_op(self) | 用于加载资源(如字典等)。 |
| self.concurrency_idx | 当前进程(非线程)的并发数索引(不同种类的 OP 单独计算)。 |
| self.concurrency_idx | 当前进程(非线程)的并发数索引(不同种类的 Op 单独计算)。 |
OP 在一个运行周期中会依次执行 preprocess,process,postprocess 三个操作(当不设置 `server_endpoints` 参数时,不执行 process 操作),用户可以对这三个函数进行重写,默认实现如下:
Op 在一个运行周期中会依次执行 preprocess,process,postprocess 三个操作(当不设置 `server_endpoints` 参数时,不执行 process 操作),用户可以对这三个函数进行重写,默认实现如下:
```python
def preprocess(self, input_dicts):
......@@ -219,7 +384,7 @@ def postprocess(self, input_dicts, fetch_dict):
return fetch_dict
```
**preprocess** 的参数是前继 Channel 中的数据 `input_dicts`,该变量(作为一个 **sample**)是一个以前继 OP 的 name 为 Key,对应 OP 的输出为 Value 的字典。
**preprocess** 的参数是前继 Channel 中的数据 `input_dicts`,该变量(作为一个 **sample**)是一个以前继 Op 的 name 为 Key,对应 Op 的输出为 Value 的字典。
**process** 的参数是 Paddle Serving Client 预测接口的输入变量 `fetch_dict_list`(preprocess 函数的返回值的列表),该变量(作为一个 **batch**)是一个列表,列表中的元素为以 feed_name 为 Key,对应 ndarray 格式的数据为 Value 的字典。`typical_logid` 作为向 PaddleServingService 穿透的 logid。
......@@ -232,11 +397,13 @@ def init_op(self):
pass
```
需要**注意**的是,在线程版 OP 中,每个 OP 只会调用一次该函数,故加载的资源必须要求是线程安全的。
### 2.4 RequestOp 定义 与 二次开发接口
RequestOp 和 ResponseOp 是 Python Pipeline 的中2个特殊 Op,分别是用分解 RPC 数据加入到图执行引擎中,和拿到图执行引擎的预测结果并打包 RPC 数据到客户端。
RequestOp 类的设计如下所示,核心是在 unpack_request_package 函数中解析请求数据,因此,当修改 Request 结构后重写此函数实现全新的解包处理。
RequestOp 用于处理 Pipeline Server 接收到的 RPC 数据,处理后的数据将会被加入到图执行引擎中。其功能实现如下:
| 接口 | 说明 |
| :---------------------------------------: | :----------------------------------------: |
| init_op(self) | OP初始化,设置默认名称@DAGExecutor |
| unpack_request_package(self, request) | 解析请求数据 |
```python
class RequestOp(Op):
......@@ -267,17 +434,12 @@ class RequestOp(Op):
return dict_data, log_id, None, ""
```
**unpack_request_package** 的默认实现是将 RPC request 中的 key 和 value 做成字典交给第一个自定义OP。当默认的RequestOp无法满足参数解析需求时,可通过重写下面2个接口自定义请求参数解析方法
ResponseOp 类的设计如下所示,核心是在 pack_response_package 中打包返回结构,因此修改 Response 结构后重写此函数实现全新的打包格式
| 接口 | 说明 |
| :---------------------------------------: | :----------------------------------------: |
| init_op(self) | OP初始化,设置默认名称@DAGExecutor |
| unpack_request_package(self, request) | 处理接收的RPC数据 |
### 2.5 ResponseOp 定义 与 二次开发接口
ResponseOp 用于处理图执行引擎的预测结果,处理后的数据将会作为 Pipeline Server 的RPC 返回值,其函数实现如下,在pack_response_package中做了精简
| 接口 | 说明 |
| :------------------------------------------: | :-----------------------------------------: |
| init_op(self) | Op 初始化,设置默认名称 @DAGExecutor |
| pack_response_package(self, channeldata) | 处理接收的 RPC 数据 |
```python
class ResponseOp(Op):
......@@ -306,267 +468,11 @@ class ResponseOp(Op):
return resp
```
**pack_response_package** 的默认实现是将预测结果的字典转化为 RPC response 中的 key 和 value。当默认的 ResponseOp 无法满足结果返回格式要求时,可通过重写下面2个接口自定义返回包打包方法。
| 接口 | 说明 |
| :------------------------------------------: | :-----------------------------------------: |
| init_op(self) | OP初始化,设置默认名称@DAGExecutor |
| pack_response_package(self, channeldata) | 处理接收的RPC数据 |
***
## 3.典型示例
所有Pipeline示例在[examples/Pipeline/](../../examples/Pipeline) 目录下,目前有7种类型模型示例:
- [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../../examples/Pipeline/PaddleNLP/bert)
- [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../../examples/Pipeline/simple_web_service)
以 imdb_model_ensemble 为例来展示如何使用 Pipeline Serving,相关代码在 `Serving/examples/Pipeline/imdb_model_ensemble` 文件夹下可以找到,例子中的 Server 端结构如下图所示:
<div align=center>
<img src='../images/pipeline_serving-image4.png' height = "200" align="middle"/>
</div>
### 3.1 Pipeline部署需要的文件
需要五类文件,其中模型文件、配置文件、服务端代码是构建Pipeline服务必备的三个文件。测试客户端和测试数据集为测试准备
- 模型文件
- 配置文件(config.yml)
- 服务级别:服务端口、gRPC线程数、服务超时、重试次数等
- DAG级别:资源类型、开启Trace、性能profile
- OP级别:模型路径、并发度、推理方式、计算硬件、推理超时、自动批量等
- 服务端(web_server.py)
- 服务级别:定义服务名称、读取配置文件、启动服务
- DAG级别:指定多OP之间的拓扑关系
- OP级别:重写OP前后处理
- 测试客户端
- 正确性校验
- 压力测试
- 测试数据集
- 图片、文本、语音等
### 3.2 获取模型文件
```shell
cd Serving/examples/Pipeline/imdb_model_ensemble
sh get_data.sh
python -m paddle_serving_server.serve --model imdb_cnn_model --port 9292 &> cnn.log &
python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow.log &
```
PipelineServing 也支持本地自动启动 PaddleServingService,请参考 `Serving/examples/Pipeline/PaddleOCR/ocr` 下的例子。
<a name="3.5"></a>
### 3.3 创建config.yaml
本示例采用了brpc的client连接类型,还可以选择grpc或local_predictor。
```yaml
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 18070
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18071
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 4
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: False
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: True
#重试次数
retry: 1
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
use_profile: False
#channel的最大长度,默认为0
channel_size: 0
#tracer, 跟踪框架吞吐,每个OP和channel的工作情况。无tracer时不生成数据
tracer:
#每次trace的时间间隔,单位秒/s
interval_s: 10
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc, grpc和local_predictor
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# Serving交互超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
cnn:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9292"]
# cnn模型client端配置
client_config: "imdb_cnn_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
combine:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
# 批量查询Serving的数量, 默认1。
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
```
### 3.4 实现Server并启动服务
代码示例中,重点留意3个自定义Op的preprocess、postprocess处理,以及Combin Op初始化列表input_ops=[bow_op, cnn_op],设置Combin Op的前置OP列表。
**五.自定义业务错误类型**
```python
from paddle_serving_server.pipeline import Op, RequestOp, ResponseOp
from paddle_serving_server.pipeline import PipelineServer
from paddle_serving_server.pipeline.proto import pipeline_service_pb2
from paddle_serving_server.pipeline.channel import ChannelDataEcode
import numpy as np
from paddle_serving_app.reader import IMDBDataset
class ImdbRequestOp(RequestOp):
def init_op(self):
self.imdb_dataset = IMDBDataset()
self.imdb_dataset.load_resource('imdb.vocab')
def unpack_request_package(self, request):
dictdata = {}
for idx, key in enumerate(request.key):
if key != "words":
continue
words = request.value[idx]
word_ids, _ = self.imdb_dataset.get_words_and_label(words)
dictdata[key] = np.array(word_ids)
return dictdata
class CombineOp(Op):
def preprocess(self, input_data):
combined_prediction = 0
for op_name, data in input_data.items():
combined_prediction += data["prediction"]
data = {"prediction": combined_prediction / 2}
return data
read_op = ImdbRequestOp()
bow_op = Op(name="bow",
input_ops=[read_op],
server_endpoints=["127.0.0.1:9393"],
fetch_list=["prediction"],
client_config="imdb_bow_client_conf/serving_client_conf.prototxt",
concurrency=1,
timeout=-1,
retry=1)
cnn_op = Op(name="cnn",
input_ops=[read_op],
server_endpoints=["127.0.0.1:9292"],
fetch_list=["prediction"],
client_config="imdb_cnn_client_conf/serving_client_conf.prototxt",
concurrency=1,
timeout=-1,
retry=1)
combine_op = CombineOp(
name="combine",
input_ops=[bow_op, cnn_op],
concurrency=5,
timeout=-1,
retry=1)
# use default ResponseOp implementation
response_op = ResponseOp(input_ops=[combine_op])
server = PipelineServer()
server.set_response_op(response_op)
server.prepare_server('config.yml')
server.run_server()
```
### 3.5 推理测试
```python
from paddle_serving_client.pipeline import PipelineClient
import numpy as np
client = PipelineClient()
client.connect(['127.0.0.1:18080'])
words = 'i am very sad | 0'
futures = []
for i in range(3):
futures.append(
client.predict(
feed_dict={"words": words},
fetch=["prediction"],
asyn=True))
for f in futures:
res = f.result()
if res["ecode"] != 0:
print(res)
exit(1)
```
***
## 4.高阶用法
### 4.1 业务自定义错误类型
用户可根据业务场景自定义错误码,继承ProductErrCode,在Op的preprocess或postprocess中返回列表中返回,下一阶段处理会根据自定义错误码跳过后置OP处理。
用户可根据业务场景自定义错误码,继承 ProductErrCode,在 Op 的 preprocess 或 postprocess 中返回列表中返回,下一阶段处理会根据自定义错误码跳过后置OP处理。
```python
class ProductErrCode(enum.Enum):
"""
......@@ -576,259 +482,34 @@ class ProductErrCode(enum.Enum):
pass
```
### 4.2 跳过OP process阶段
preprocess返回结果列表的第二个结果是`is_skip_process=True`表示是否跳过当前OP的process阶段,直接进入postprocess处理
```python
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
self._log(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func."))
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict, False, None, ""
```
### 4.3 自定义proto Request 和 Response结构
当默认proto结构不满足业务需求时,同时下面2个文件的proto的Request和Response message结构,保持一致。
> pipeline/gateway/proto/gateway.proto
> pipeline/proto/pipeline_service.proto
再重新编译Serving Server。
### 4.4 自定义URL
grpc gateway处理post请求,默认`method``prediction`,例如:127.0.0.1:8080/ocr/prediction。用户可自定义name和method,对于已有url的服务可无缝切换
```proto
service PipelineService {
rpc inference(Request) returns (Response) {
option (google.api.http) = {
post : "/{name=*}/{method=*}"
body : "*"
};
}
};
```
### 4.5 批量推理
Pipeline支持批量推理,通过增大batch size可以提高GPU利用率。Pipeline Pipeline Serving支持3种batch形式以及适用的场景如下:
- 场景1:一个推理请求包含批量数据(batch)
- 单条数据定长,批量变长,数据转成BCHW格式
- 单条数据变长,前处理中将单条数据做padding转成定长
- 场景2:一个推理请求的批量数据拆分成多个小块推理(mini-batch)
- 由于padding会按最长对齐,当一批数据中有个"极大"尺寸数据时会导致推理变慢
- 指定一个块大小,从而缩小"极大"尺寸数据的作用范围
- 场景3:合并多个请求数据批量推理(auto-batching)
- 推理耗时明显长于前后处理,合并多个请求数据推理一次会提高吞吐和GPU利用率
- 要求多个request的数据的shape一致
| 接口 | 说明 |
| :------------------------------------------: | :-----------------------------------------: |
| batch | client发送批量数据,client.predict的batch=True |
| mini-batch | preprocess按list类型返回,参考OCR示例 RecOp的preprocess|
| auto-batching | config.yml中OP级别设置batch_size和auto_batching_timeout |
### 4.6 单机多卡
单机多卡推理,M个OP进程与N个GPU卡绑定,在config.yml中配置3个参数有关系,首先选择进程模式、并发数即进程数,devices是GPU卡ID。绑定方法是进程启动时遍历GPU卡ID,例如启动7个OP进程,config.yml设置devices:0,1,2,那么第1,4,7个启动的进程与0卡绑定,第2,4个启动的进程与1卡绑定,3,6进程与卡2绑定。
- 进程ID: 0 绑定 GPU 卡0
- 进程ID: 1 绑定 GPU 卡1
- 进程ID: 2 绑定 GPU 卡2
- 进程ID: 3 绑定 GPU 卡0
- 进程ID: 4 绑定 GPU 卡1
- 进程ID: 5 绑定 GPU 卡2
- 进程ID: 6 绑定 GPU 卡0
config.yml中硬件配置:
```
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "0,1,2"
```
### 4.7 异构硬件
Pipeline除了支持CPU、GPU之外,还支持在多种异构硬件部署。在config.yml中由device_type和devices。优先使用device_type指定类型,当空缺时根据devices判断。device_type描述如下:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
config.yml中硬件配置:
```
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" # "0,1"
```
### 4.8 低精度推理
Pipeline Serving支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下图所示:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
使用int8时,要开启use_calib: True
参考[simple_web_service](../../examples/Pipeline/simple_web_service)示例
***
## 5.日志追踪
Pipeline服务日志在当前目录的PipelineServingLogs目录下,有3种类型日志,分别是pipeline.log日志、pipeline.log.wf日志、pipeline.tracer日志。
- `pipeline.log` : 记录 debug & info日志信息
- `pipeline.log.wf` : 记录 warning & error日志
- `pipeline.tracer` : 统计各个阶段耗时、channel堆积信息
在服务发生异常时,错误信息会记录在pipeline.log.wf日志中。打印tracer日志要求在config.yml的DAG属性中添加tracer配置。
### 5.1 log唯一标识
Pipeline中有2种id用以串联请求,分别时data_id和log_id,二者区别如下:
- data_id : Pipeline框架生成的自增ID,标记请求唯一性标识
- log_id : 上游模块传入的标识,跟踪多个服务间串联关系,由于用户可不传入或不保证唯一性,因此不能作为唯一性标识
通常,Pipeline框架打印的日志会同时带上data_id和log_id。开启auto-batching后,会使用批量中的第一个data_id标记batch整体,同时框架会在一条日志中打印批量中所有data_id。
### 5.2 日志滚动
Pipeline的日志模块在`logger.py`中定义,使用了`logging.handlers.RotatingFileHandler`支持磁盘日志文件的轮换。根据不同文件级别和日质量分别设置了`maxBytes``backupCount`,当即将超出预定大小时,将关闭旧文件并打开一个新文件用于输出。
其使用方法如下所示,定义了一种错误类型 `Product_Error` ,在 `preprocess` 函数返回值中设置错误信息,在 `postprocess` 函数中也可以设置。
```python
"handlers": {
"f_pipeline.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log"),
"maxBytes": 512000000,
"backupCount": 20,
},
"f_pipeline.log.wf": {
"class": "logging.handlers.RotatingFileHandler",
"level": "WARNING",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log.wf"),
"maxBytes": 512000000,
"backupCount": 10,
},
"f_tracer.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "tracer_fmt",
"filename": os.path.join(log_dir, "pipeline.tracer"),
"maxBytes": 512000000,
"backupCount": 5,
},
},
```
***
## 6.性能分析与优化
### 6.1 如何通过 Timeline 工具进行优化
为了更好地对性能进行优化,PipelineServing 提供了 Timeline 工具,对整个服务的各个阶段时间进行打点。
### 6.2 在 Server 端输出 Profile 信息
Server 端用 yaml 中的 `use_profile` 字段进行控制:
```yaml
dag:
use_profile: true
```
开启该功能后,Server 端在预测的过程中会将对应的日志信息打印到标准输出,为了更直观地展现各阶段的耗时,提供 Analyst 模块对日志文件做进一步的分析处理。
使用时先将 Server 的输出保存到文件,以 `profile.txt` 为例,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件,`trace` 文件可以通过 chrome 浏览器的 tracing 功能进行可视化。
```python
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
具体操作:打开 chrome 浏览器,在地址栏输入 `chrome://tracing/` ,跳转至 tracing 页面,点击 load 按钮,打开保存的 `trace` 文件,即可将预测服务的各阶段时间信息可视化。
### 6.3 在 Client 端输出 Profile 信息
Client 端在 `predict` 接口设置 `profile=True`,即可开启 Profile 功能。
开启该功能后,Client 端在预测的过程中会将该次预测对应的日志信息打印到标准输出,后续分析处理同 Server。
### 6.4 分析方法
根据pipeline.tracer日志中的各个阶段耗时,按以下公式逐步分析出主要耗时在哪个阶段。
```
单OP耗时:
op_cost = process(pre + mid + post)
OP期望并发数:
op_concurrency = 单OP耗时(s) * 期望QPS
服务吞吐量:
service_throughput = 1 / 最慢OP的耗时 * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
class ProductErrCode(enum.Enum):
"""
ProductErrCode is a base class for recording business error code.
product developers inherit this class and extend more error codes.
"""
Product_Error = 100001,
Channel堆积:
channel_acc_size = QPS(down - up) * time
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
(_, input_dict), = input_dicts.items()
if input_dict.get_key("product_error"):
return input_dict, False, Product_Error, "Product Error Occured"
return input_dict, False, None, ""
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
### 6.5 优化思路
根据长耗时在不同阶段,采用不同的优化方法.
- OP推理阶段(mid-process):
- 增加OP并发度
- 开启auto-batching(前提是多个请求的shape一致)
- 若批量数据中某条数据的shape很大,padding很大导致推理很慢,可使用mini-batch
- 开启TensorRT/MKL-DNN优化
- 开启低精度推理
- OP前处理阶段(pre-process):
- 增加OP并发度
- 优化前处理逻辑
- in/out耗时长(channel堆积>5)
- 检查channel传递的数据大小和延迟
- 优化传入数据,不传递数据或压缩后再传入
- 增加OP并发度
- 减少上游OP并发度
# Pipeline Serving
([简体中文](Pipeline_Design_CN.md)|English)
- [Architecture Design](Pipeline_Design_EN.md#1architecture-design)
- [Detailed Design](Pipeline_Design_EN.md#2detailed-design)
- [Classic Examples](Pipeline_Design_EN.md#3classic-examples)
- [Advanced Usages](Pipeline_Design_EN.md#4advanced-usages)
- [Log Tracing](Pipeline_Design_EN.md#5log-tracing)
- [Performance Analysis And Optimization](Pipeline_Design_EN.md#6performance-analysis-and-optimization)
In many deep learning frameworks, Serving is usually used for the deployment of single model.but in the context of AI industrial, the end-to-end deep learning model can not solve all the problems at present. Usually, it is necessary to use multiple deep learning models to solve practical problems.However, the design of multi-model applications is complicated. In order to reduce the difficulty of development and maintenance, and to ensure the availability of services, serial or simple parallel methods are usually used. In general, the throughput only reaches the usable state and the GPU utilization rate is low.
Paddle Serving provides a user-friendly programming framework for multi-model composite services, Pipeline Serving, which aims to reduce the threshold of programming, improve resource utilization (especially GPU), and improve the prediction efficiency.
## 1.Architecture Design
The Server side is built based on <b>RPC Service</b> and <b>graph execution engine</b>. The relationship between them is shown in the following figure.
<div align=center>
<img src='../images/pipeline_serving-image1.png' height = "250" align="middle"/>
</div>
### 1.1 RPC Service
In order to meet the needs of different users, the RPC service starts one Web server and one RPC server at the same time, and can process 2 types of requests, RESTful API and gRPC.The gPRC gateway receives RESTful API requests and forwards requests to the gRPC server through the reverse proxy server; gRPC requests are received by the gRPC server, so the two types of requests are processed by the gRPC Service in a unified manner to ensure that the processing logic is consistent.
#### <b>1.1.1 Request and Respose of proto</b>
gRPC service and gRPC gateway service are generated with service.proto.
```proto
message Request {
repeated string key = 1;
repeated string value = 2;
optional string name = 3;
optional string method = 4;
optional int64 logid = 5;
optional string clientip = 6;
};
message Response {
optional int32 err_no = 1;
optional string err_msg = 2;
repeated string key = 3;
repeated string value = 4;
};
```
The `key` and `value` in the Request are paired string arrays. The `name` and `method` correspond to the URL of the RESTful API://{ip}:{port}/{name}/{method}.The `logid` and `clientip` are convenient for users to connect service-level requests and customize strategies.
In Response, `err_no` and `err_msg` express the correctness and error information of the processing result, and `key` and `value` are the returned results.
### 1.2 Graph Execution Engine
The graph execution engine consists of OPs and Channels, and the connected OPs share one Channel.
- Channel can be understood as a buffer queue. Each OP accepts only one Channel input and multiply Channel outputs (each output is the same); a Channel can contain outputs from multiple OPs, and data from the same Channel can be used as input for multiple OPs.
- Users only need to define relationships between OPs. Graph engine will analyze the dependencies of the entire graph and declaring Channels at the compile time.
- After Request data enters the graph execution engine service, the graph engine will generator an Request ID, and Reponse is returned through corresponding Request ID.
- For cases where large data needs to be transferred between OPs, consider RAM DB external memory for global storage and data transfer by passing index keys in Channel.
<div align=center>
<img src='../images/pipeline_serving-image2.png' height = "300" align="middle"/>
</div>
#### <b>1.2.1 OP Design</b>
- The default function of a single OP is to access a single Paddle Serving Service based on the input Channel data and put the result into the output Channel.
- OP supports user customization, including preprocess, process, postprocess functions that can be inherited and implemented by the user.
- OP can set the number of concurrencies to increase the number of concurrencies processed.
- OP can obtain data from multiple different RPC requests for Auto-Batching.
- OP can be started by a thread or process.
#### <b>1.2.2 Channel Design</b>
- Channel is the data structure for sharing data between OPs, responsible for sharing data or sharing data status information.
- Outputs from multiple OPs can be stored in the same Channel, and data from the same Channel can be used by multiple OPs.
- The following illustration shows the design of Channel in the graph execution engine, using input buffer and output buffer to align data between multiple OP inputs and multiple OP outputs, with a queue in the middle to buffer.
<div align=center>
<img src='../images/pipeline_serving-image3.png' height = "500" align="middle"/>
</div>
#### <b>1.2.3 client type design</b>
- Prediction type (client_type) of Op has 3 types, brpc, grpc and local_predictor
- brpc: Using bRPC Client to interact with remote Serving by network, performance is better than grpc.
- grpc: Using gRPC Client to interact with remote Serving by network, cross-platform deployment supported.
- local_predictor: Load the model and predict in the local service without interacting with the network. Support multi-card deployment, and TensorRT prediction.
- Selection:
- Time cost(lower is better): local_predict < brpc <= grpc
- Microservice: Split the brpc or grpc model into independent services, simplify development and deployment complexity, and improve resource utilization
#### <b>1.2.4 Extreme Case Consideration</b>
- `Request timeout`
The entire graph execution engine may time out at every step. The graph execution engine controls the time out by setting `timeout` value. Requests that time out at any step will return a timeout response.
- `Channel stores too much data`
Channels may store too much data, causing copy time to be too high. Graph execution engines can store OP calculation results in external memory, such as high-speed memory KV systems.
- `Whether input buffers and output buffers in Channel will increase indefinitely`
- It will not increase indefinitely. The input to the entire graph execution engine is placed inside a Channel's internal queue, directly acting as a traffic control buffer queue for the entire service.
- For input buffer, adjust the number of concurrencies of OP1 and OP2 according to the amount of computation, so that the number of input buffers from each input OP is relatively balanced. (The length of the input buffer depends on the speed at which each item in the internal queue is ready)
- For output buffer, you can use a similar process as input buffer, which adjusts the concurrency of OP3 and OP4 to control the buffer length of output buffer. (The length of the output buffer depends on the speed at which downstream OPs obtain data from the output buffer)
- The amount of data in the Channel will not exceed `worker_num` of gRPC, that is, it will not exceed the thread pool size.
***
## 2.Detailed Design
For the design and implementation of Pipeline, first introduce PipelineServer, OP, pre- and post-processing of rewriting OP, and finally introduce the secondary development method of specific OP (RequestOp and ResponseOp).
### 2.1 PipelineServer Definition
PipelineServer encapsulates the RPC runtime layer and graph engine execution. All Pipeline services must first instantiate the PipelineServer example, then set up two core steps, set response op and load configuration information, and finally call run_server to start the service. The code example is as follows:
```python
server = PipelineServer()
server.set_response_op(response_op)
server.prepare_server(config_yml_path)
#server.prepare_pipeline_config(config_yml_path)
server.run_server()
```
The core interface of PipelineServer:
- `set_response_op`: setting response_op will initialize the Channel according to the topological relationship of each OP and build a calculation graph.
- `prepare_server`: load configuration information, and start remote Serving service, suitable for calling remote remote reasoning service.
- `prepare_pipeline_config`: only load configuration information, applicable to local_prdict
- `run_server`: start gRPC service, receive request.
### 2.2 OP Definition
As the basic unit of graph execution engine, the general OP constructor is as follows:
```python
def __init__(name=None,
input_ops=[],
server_endpoints=[],
fetch_list=[],
client_config=None,
client_type=None,
concurrency=1,
timeout=-1,
retry=1,
batch_size=1,
auto_batching_timeout=None,
local_service_handler=None)
```
The meaning of each parameter is as follows:
| Parameter | Meaning |
| :-------------------: | :----------------------------------------------------------: |
| name | (str) String used to identify the OP type, which must be globally unique. |
| input_ops | (list) A list of all previous OPs of the current Op. |
| server_endpoints | (list) List of endpoints for remote Paddle Serving Service. If this parameter is not set,it is considered as local_precditor mode, and the configuration is read from local_service_conf |
| fetch_list | (list) List of fetch variable names for remote Paddle Serving Service. |
| client_config | (str) The path of the client configuration file corresponding to the Paddle Serving Service. |
| client_type | (str)brpc, grpc or local_predictor. local_predictor does not start the Serving service, in-process prediction|
| concurrency | (int) The number of concurrent OPs. |
| timeout | (int) The timeout time of the process operation, in ms. If the value is less than zero, no timeout is considered. |
| retry | (int) Timeout number of retries. When the value is 1, no retries are made. |
| batch_size | (int) The expected batch_size of Auto-Batching, since building batches may time out, the actual batch_size may be less than the set value. |
| auto_batching_timeout | (float) Timeout for building batches of Auto-Batching (the unit is ms). When batch_size> 1, auto_batching_timeout should be set, otherwise the waiting will be blocked when the number of requests is insufficient for batch_size|
| local_service_handler | (object) local predictor handler,assigned by Op init() input parameters or created in Op init()|
### 2.3 Rewrite preprocess and postprocess of OP
| Interface or Variable | Explain |
| :----------------------------------------------: | :----------------------------------------------------------: |
| def preprocess(self, input_dicts) | Process the data obtained from the channel, and the processed data will be used as the input of the **process** function. (This function handles a **sample**) |
| def process(self, feed_dict_list, typical_logid) | The RPC prediction process is based on the Paddle Serving Client, and the processed data will be used as the input of the **postprocess** function. (This function handles a **batch**) |
| def postprocess(self, input_dicts, fetch_dict) | After processing the prediction results, the processed data will be put into the subsequent Channel to be obtained by the subsequent OP. (This function handles a **sample**) |
| def init_op(self) | Used to load resources (such as word dictionary). |
| self.concurrency_idx | Concurrency index of current process(not thread) (different kinds of OP are calculated separately). |
In a running cycle, OP will execute three operations: preprocess, process, and postprocess (when the `server_endpoints` parameter is not set, the process operation is not executed). Users can rewrite these three functions. The default implementation is as follows:
```python
def preprocess(self, input_dicts):
# multiple previous Op
if len(input_dicts) != 1:
raise NotImplementedError(
'this Op has multiple previous inputs. Please override this func.'
(_, input_dict), = input_dicts.items()
return input_dict
def process(self, feed_dict_list, typical_logid):
err, err_info = ChannelData.check_batch_npdata(feed_dict_list)
if err != 0:
raise NotImplementedError(
"{} Please override preprocess func.".format(err_info))
call_result = self.client.predict(
feed=feed_dict_list, fetch=self._fetch_names, log_id=typical_logid)
if isinstance(self.client, MultiLangClient):
if call_result is None or call_result["serving_status_code"] != 0:
return None
call_result.pop("serving_status_code")
return call_result
def postprocess(self, input_dicts, fetch_dict):
return fetch_dict
```
The parameter of **preprocess** is the data `input_dicts` in the previous Channel. This variable (as a **sample**) is a dictionary with the name of the previous OP as key and the output of the corresponding OP as value.
The parameter of **process** is the input variable `fetch_dict_list` (a list of the return value of the preprocess function) of the Paddle Serving Client prediction interface. This variable (as a **batch**) is a list of dictionaries with feed_name as the key and the data in the ndarray format as the value. `typical_logid` is used as the logid that penetrates to PaddleServingService.
The parameters of **postprocess** are `input_dicts` and `fetch_dict`. `input_dicts` is consistent with the parameter of preprocess, and `fetch_dict` (as a **sample**) is a sample of the return batch of the process function (if process is not executed, this value is the return value of preprocess).
Users can also rewrite the **init_op** function to load some custom resources (such as word dictionary). The default implementation is as follows:
```python
def init_op(self):
pass
```
It should be **noted** that in the threaded version of OP, each OP will only call this function once, so the loaded resources must be thread safe.
### 2.4 RequestOp Definition and Secondary Development Interface
RequestOp is used to process RPC data received by Pipeline Server, and the processed data will be added to the graph execution engine. Its constructor is as follows:
```python
class RequestOp(Op):
def __init__(self):
# PipelineService.name = "@DAGExecutor"
super(RequestOp, self).__init__(name="@DAGExecutor", input_ops=[])
# init op
try:
self.init_op()
except Exception as e:
_LOGGER.critical("Op(Request) Failed to init: {}".format(e))
os._exit(-1)
def unpack_request_package(self, request):
dict_data = {}
log_id = None
if request is None:
_LOGGER.critical("request is None")
raise ValueError("request is None")
for idx, key in enumerate(request.key):
dict_data[key] = request.value[idx]
log_id = request.logid
_LOGGER.info("RequestOp unpack one request. log_id:{}, clientip:{} \
name:{}, method:{}".format(log_id, request.clientip, request.name,
request.method))
return dict_data, log_id, None, ""
```
The default implementation of **unpack_request_package** is to make the key and value in RPC request into a dictionary.When the default RequestOp cannot meet the parameter parsing requirements, you can customize the request parameter parsing method by rewriting the following two interfaces.The return value is required to be a dictionary type.
| Interface or Variable | Explain |
| :---------------------------------------: | :----------------------------------------------------------: |
| def init_op(self) | It is used to load resources (such as dictionaries), and is consistent with general OP. |
| def unpack_request_package(self, request) | Process received RPC data. |
### 2.5 ResponseOp Definition and Secondary Development Interface
ResponseOp is used to process the prediction results of the graph execution engine. The processed data will be used as the RPC return value of Pipeline Server. Its constructor is as follows:
```python
class RequestOp(Op):
def __init__(self):
# PipelineService.name = "@DAGExecutor"
super(RequestOp, self).__init__(name="@DAGExecutor", input_ops=[])
# init op
try:
self.init_op()
except Exception as e:
_LOGGER.critical("Op(Request) Failed to init: {}".format(e))
os._exit(-1)
def unpack_request_package(self, request):
dict_data = {}
log_id = None
if request is None:
_LOGGER.critical("request is None")
raise ValueError("request is None")
for idx, key in enumerate(request.key):
dict_data[key] = request.value[idx]
log_id = request.logid
_LOGGER.info("RequestOp unpack one request. log_id:{}, clientip:{} \
name:{}, method:{}".format(log_id, request.clientip, request.name,
request.method))
return dict_data, log_id, None, ""
```
The default implementation of **pack_response_package** is to convert the dictionary of prediction results into key and value in RPC response.When the default ResponseOp cannot meet the requirements of the result return format, you can customize the return package packaging method by rewriting the following two interfaces.
| Interface or Variable | Explain |
| :------------------------------------------: | :----------------------------------------------------------: |
| def init_op(self) | It is used to load resources (such as dictionaries), and is consistent with general OP. |
| def pack_response_package(self, channeldata) | Process the prediction results of the graph execution engine as the return of RPC. |
***
## 3.Classic Examples
All examples of pipelines are in [examples/pipeline/](../../examples/Pipeline) directory, There are 7 types of model examples currently:
- [PaddleClas](../../examples/Pipeline/PaddleClas)
- [Detection](../../examples/Pipeline/PaddleDetection)
- [bert](../../examples/Pipeline/PaddleNLP/bert)
- [imagenet](../../examples/Pipeline/PaddleClas/imagenet)
- [imdb_model_ensemble](../../examples/Pipeline/imdb_model_ensemble)
- [ocr](../../examples/Pipeline/PaddleOCR/ocr)
- [simple_web_service](../../examples/Pipeline/simple_web_service)
Here, we build a simple imdb model enable example to show how to use Pipeline Serving. The relevant code can be found in the `Serving/examples/Pipeline/imdb_model_ensemble` folder. The Server-side structure in the example is shown in the following figure:
<div align=center>
<img src='../images/pipeline_serving-image4.png' height = "200" align="middle"/>
</div>
### 3.1 Files required for pipeline deployment
Five types of files are needed, of which model files, configuration files, and server code are the three necessary files for building a Pipeline service. Test client and test data set are prepared for testing.
- model files
- configure files(config.yml)
- service level: Service port, thread number, service timeout, retry, etc.
- DAG level: Resource type, enable Trace, performance profile, etc.
- OP level: Model path, concurrency, client type, device type, automatic batching, etc.
- Server files(web_server.py)
- service level: Define service name, read configuration file, start service, etc.
- DAG level: Topological relationship between OPs.
- OP level: Rewrite preprocess and postprocess of OP.
- Test client files
- Correctness check
- Performance testing
- Test data set
- pictures, texts, voices, etc.
### 3.2 Get model files
```shell
cd Serving/examples/Pipeline/imdb_model_ensemble
sh get_data.sh
python -m paddle_serving_server.serve --model imdb_cnn_model --port 9292 &> cnn.log &
python -m paddle_serving_server.serve --model imdb_bow_model --port 9393 &> bow.log &
```
PipelineServing also supports local automatic startup of PaddleServingService. Please refer to the example `Serving/examples/Pipeline/PaddleOCR/ocr`.
### 3.3 Create config.yaml
This example uses the client connection type of brpc, and you can also choose grpc or local_predictor.
```yaml
#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1
rpc_port: 18070
#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port
http_port: 18071
#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG
#当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 4
#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG
build_dag_each_worker: False
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: True
#重试次数
retry: 1
#使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用
use_profile: False
#channel的最大长度,默认为0
channel_size: 0
#tracer, 跟踪框架吞吐,每个OP和channel的工作情况。无tracer时不生成数据
tracer:
#每次trace的时间间隔,单位秒/s
interval_s: 10
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc, grpc和local_predictor
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# Serving交互超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
cnn:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc
client_type: brpc
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
# Serving IPs
server_endpoints: ["127.0.0.1:9292"]
# cnn模型client端配置
client_config: "imdb_cnn_client_conf/serving_client_conf.prototxt"
# Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
# 批量查询Serving的数量, 默认1。
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
combine:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# Serving交互重试次数,默认不重试
retry: 1
# 预测超时时间, 单位ms
timeout: 3000
# 批量查询Serving的数量, 默认1。
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
```
### 3.4 Start PipelineServer
Run the following code
```python
from paddle_serving_server.pipeline import Op, RequestOp, ResponseOp
from paddle_serving_server.pipeline import PipelineServer
from paddle_serving_server.pipeline.proto import pipeline_service_pb2
from paddle_serving_server.pipeline.channel import ChannelDataEcode
import numpy as np
from paddle_serving_app.reader import IMDBDataset
class ImdbRequestOp(RequestOp):
def init_op(self):
self.imdb_dataset = IMDBDataset()
self.imdb_dataset.load_resource('imdb.vocab')
def unpack_request_package(self, request):
dictdata = {}
for idx, key in enumerate(request.key):
if key != "words":
continue
words = request.value[idx]
word_ids, _ = self.imdb_dataset.get_words_and_label(words)
dictdata[key] = np.array(word_ids)
return dictdata
class CombineOp(Op):
def preprocess(self, input_data):
combined_prediction = 0
for op_name, data in input_data.items():
_LOGGER.info("{}: {}".format(op_name, data["prediction"]))
combined_prediction += data["prediction"]
data = {"prediction": combined_prediction / 2}
return data
read_op = ImdbRequestOp()
bow_op = Op(name="bow",
input_ops=[read_op],
server_endpoints=["127.0.0.1:9393"],
fetch_list=["prediction"],
client_config="imdb_bow_client_conf/serving_client_conf.prototxt",
concurrency=1,
timeout=-1,
retry=1)
cnn_op = Op(name="cnn",
input_ops=[read_op],
server_endpoints=["127.0.0.1:9292"],
fetch_list=["prediction"],
client_config="imdb_cnn_client_conf/serving_client_conf.prototxt",
concurrency=1,
timeout=-1,
retry=1)
combine_op = CombineOp(
name="combine",
input_ops=[bow_op, cnn_op],
concurrency=5,
timeout=-1,
retry=1)
# use default ResponseOp implementation
response_op = ResponseOp(input_ops=[combine_op])
server = PipelineServer()
server.set_response_op(response_op)
server.prepare_server('config.yml')
server.run_server()
```
### 3.5 Perform prediction through PipelineClient
```python
from paddle_serving_client.pipeline import PipelineClient
import numpy as np
client = PipelineClient()
client.connect(['127.0.0.1:18080'])
words = 'i am very sad | 0'
futures = []
for i in range(3):
futures.append(
client.predict(
feed_dict={"words": words},
fetch=["prediction"],
asyn=True))
for f in futures:
res = f.result()
if res["ecode"] != 0:
print(res)
exit(1)
```
***
## 4.Advanced Usages
### 4.1 Business custom error type
Users can customize the error code according to the business, inherit ProductErrCode, and return it in the return list in Op's preprocess or postprocess. The next stage of processing will skip the post OP processing based on the custom error code.
```python
class ProductErrCode(enum.Enum):
"""
ProductErrCode is a base class for recording business error code.
product developers inherit this class and extend more error codes.
"""
pass
```
### 4.2 Skip process stage
The 2rd result of the result list returned by preprocess is `is_skip_process=True`, indicating whether to skip the process stage of the current OP and directly enter the postprocess processing
```python
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
self._log(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func."))
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict, False, None, ""
```
### 4.3 Custom proto Request and Response
When the default proto structure does not meet the business requirements, at the same time, the Request and Response message structures of the proto in the following two files remain the same.
> pipeline/gateway/proto/gateway.proto
> pipeline/proto/pipeline_service.proto
Recompile Serving Server again.
### 4.4 Custom URL
The grpc gateway processes post requests. The default `method` is `prediction`, for example: 127.0.0.1:8080/ocr/prediction. Users can customize the name and method, and can seamlessly switch services with existing URLs.
```proto
service PipelineService {
rpc inference(Request) returns (Response) {
option (google.api.http) = {
post : "/{name=*}/{method=*}"
body : "*"
};
}
};
```
### 4.5 Batch predictor
Pipeline supports batch predictor, and GPU utilization can be improved by increasing the batch size. Pipeline Serving supports 3 Pipeline Serving supports 3 batch forms and applicable scenarios are as follows:
- case 1: An inference request contains batch data (batch)
- The data is of fixed length, the batch is variable, and the data is converted into BCHW format
- The data length is variable. In the pre-processing, a single piece of data is padding converted into a fixed length
- case 2: Split the batch data of a inference request into multiple small pieces of data (mini-batch)
- Since padding will be aligned at the longest shape, when there is a "extremely large" shape size data in a batch of data, the padding is very large.
- Specify the size of a block to reduce the scope of the "extremely large" size data
- case 3: Merge multiple requests for one batch(auto-batching)
- Inference time is significantly longer than preprocess and postprocess. Merge multiple request data and inference at one time will increase throughput and GPU utilization.
- The shape of the data of multiple requests is required to be consistent
| Interfaces | Explain |
| :------------------------------------------: | :-----------------------------------------: |
| batch | client send batch data,set batch=True of client.predict |
| mini-batch | the return type of preprocess is list,refer to the preprocess of RecOp in OCR example|
| auto-batching | set batch_size and auto_batching_timeout in config.yml |
### 4.6 Single-machine and multi-card inference
Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
- PROCESS ID: 0 binds GPU card 0
- PROCESS ID: 1 binds GPU card 1
- PROCESS ID: 2 binds GPU card 2
- PROCESS ID: 3 binds GPU card 0
- PROCESS ID: 4 binds GPU card 1
- PROCESS ID: 5 binds GPU card 2
- PROCESS ID: 6 binds GPU card 0
Reference config.yml:
```
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "0,1,2"
```
### 4.7 Heterogeneous Devices
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
Reference config.yml:
```
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" # "0,1"
```
### 4.8 Low precision inference
Pipeline Serving supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
- CPU
- fp32(default)
- fp16
- bf16(mkldnn)
- GPU
- fp32(default)
- fp16
- int8
- Tensor RT
- fp32(default)
- fp16
- int8
Reference the example [simple_web_service](../../examples/Pipeline/simple_web_service).
***
## 5.Log Tracing
Pipeline service logs are under the `PipelineServingLogs` directory of the current directory. There are 3 types of logs, namely `pipeline.log`, `pipeline.log.wf`, and `pipeline.tracer`.
- pipeline.log : Record debug & info level log
- pipeline.log.wf : Record warning & error level log
- pipeline.tracer : Statistics the time-consuming and channel accumulation information in each stage
When an exception occurs in the service, the error message will be recorded in the file `pipeline.log.wf`. Printing the tracer log requires adding the tracer configuration in the DAG property of `config.yml`.
### 5.1 Log uniquely id
There are two kinds of IDs in the pipeline for concatenating requests, `data_id` and `log_id` respectively. The difference between the two is as follows:
- `data_id`: The self-incrementing ID generated by the pipeline framework, marking the unique identification of the request.
- `log_id`: The identifier passed in by the upstream module tracks the serial relationship between multiple services. Since users may not pass in or guarantee uniqueness, it cannot be used as a unique identifier.
The log printed by the Pipeline framework will carry both data_id and log_id. After auto-batching is turned on, the first `data_id` in the batch will be used to mark the whole batch, and the framework will print all data_ids in the batch in a log.
### 5.2 Log Rotating
Log module of Pipeline Serving is defined in file `logger.py`.`logging.handlers.RotatingFileHandler` is used to support the rotation of disk log files. Set `maxBytes` and `backupCount` according to different file levels and daily quality. When the predetermined size is about to be exceeded, the old file will be closed and a new file will be opened for output.
```python
"handlers": {
"f_pipeline.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log"),
"maxBytes": 512000000,
"backupCount": 20,
},
"f_pipeline.log.wf": {
"class": "logging.handlers.RotatingFileHandler",
"level": "WARNING",
"formatter": "normal_fmt",
"filename": os.path.join(log_dir, "pipeline.log.wf"),
"maxBytes": 512000000,
"backupCount": 10,
},
"f_tracer.log": {
"class": "logging.handlers.RotatingFileHandler",
"level": "INFO",
"formatter": "tracer_fmt",
"filename": os.path.join(log_dir, "pipeline.tracer"),
"maxBytes": 512000000,
"backupCount": 5,
},
},
```
***
## 6.Performance analysis and optimization
### 6.1 How to optimize with the timeline tool
In order to better optimize the performance, PipelineServing provides a timeline tool to monitor the time of each stage of the whole service.
### 6.2 Output profile information on server side
The server is controlled by the `use_profile` field in yaml:
```yaml
dag:
use_profile: true
```
After the function is enabled, the server will print the corresponding log information to the standard output in the process of prediction. In order to show the time consumption of each stage more intuitively, Analyst module is provided for further analysis and processing of log files.
The output of the server is first saved to a file. Taking `profile.txt` as an example, the script converts the time monitoring information in the log into JSON format and saves it to the `trace` file. The `trace` file can be visualized through the tracing function of Chrome browser.
```shell
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
Specific operation: open Chrome browser, input in the address bar `chrome://tracing/` , jump to the tracing page, click the load button, open the saved `trace` file, and then visualize the time information of each stage of the prediction service.
### 6.3 Output profile information on client side
The profile function can be enabled by setting `profile=True` in the `predict` interface on the client side.
After the function is enabled, the client will print the log information corresponding to the prediction to the standard output during the prediction process, and the subsequent analysis and processing are the same as that of the server.
### 6.4 Analytical methods
According to the time consumption of each stage in the pipeline.tracer log, the following formula is used to gradually analyze which stage is the main time consumption.
```
cost of one single OP:
op_cost = process(pre + mid + post)
OP Concurrency:
op_concurrency = op_cost(s) * qps_expected
Service throughput:
service_throughput = 1 / slowest_op_cost * op_concurrency
Service average cost:
service_avg_cost = ∑op_concurrency in critical Path
Channel accumulations:
channel_acc_size = QPS(down - up) * time
Average cost of batch predictor:
avg_batch_cost = (N * pre + mid + post) / N
```
### 6.5 Optimization ideas
According to the long time consuming in stages below, different optimization methods are adopted.
- OP Inference stage(mid-process):
- Increase `concurrency`
- Turn on `auto-batching`(Ensure that the shapes of multiple requests are consistent)
- Use `mini-batch`, If the shape of data is very large.
- Turn on TensorRT for GPU
- Turn on MKLDNN for CPU
- Turn on low precison inference
- OP preprocess or postprocess stage:
- Increase `concurrency`
- Optimize processing logic
- In/Out stage(channel accumulation > 5):
- Check the size and delay of the data passed by the channel
- Optimize the channel to transmit data, do not transmit data or compress it before passing it in
- Increase `concurrency`
- Decrease `concurrency` upstreams.
# Python Pipeline 核心功能
从设计上,Python Pipeline 框架实现轻量级的服务化部署,提供了丰富的核心功能,既能满足服务基本使用,又能满足特性需求。
- [安装与环境检查](#1)
- [服务启动与关闭](#2)
- [本地与远程推理](#3)
- [批量推理](#4)
- [4.1 客户端打包批量数据](#4.1)
- [4.2 服务端合并多个请求动态合并批量](#4.2)
- [4.3 Mini-Batch](#4.3)
- [单机多卡推理](#5)
- [多种计算芯片上推理](#6)
- [TensorRT 推理加速](#7)
- [MKLDNN 推理加速](#8)
- [低精度推理](#9)
- [9.1 CPU 低精度推理](#9.1)
- [9.2 GPU 和 TensorRT 低精度推理](#9.2)
- [9.3 性能测试](#9.3)
- [复杂图结构 DAG 跳过某个 Op 运行](#10)
<a name="1"></a>
## 安装与环境检查
在运行 Python Pipeline 服务前,确保当前环境下可部署且通过[安装指南](../Install_CN.md)已完成安装。其次,`v0.8.0`及以上版本提供了环境检查功能,检验环境是否安装正确。
输入以下命令,进入环境检查程序。
```python
python3 -m paddle_serving_server.serve check
```
在环境检验程序中输入多条指令来检查,例如 `check_pipeline``check_all`等,完整指令列表如下。
| 指令 | 描述|
|---------|----|
|check_all | 检查 Paddle Inference、Pipeline Serving、C++ Serving。只打印检测结果,不记录日志|
|check_pipeline | 检查 Pipeline Serving,只打印检测结果,不记录日志|
|check_cpp | 检查 C++ Serving,只打印检测结果,不记录日志|
|check_inference | 检查 Paddle Inference 是否安装正确,只打印检测结果,不记录日志|
|debug | 发生报错后,该命令将打印提示日志到屏幕,并记录详细日志文件|
|exit | 退出|
程序会分别运行 cpu 和 gpu 示例。运行成功则打印 `Pipeline cpu environment running success
``Pipeline gpu environment running success`
```
/usr/local/lib/python3.7/runpy.py:125: RuntimeWarning: 'paddle_serving_server.serve' found in sys.modules after import of package 'paddle_serving_server', but prior to execution of 'paddle_serving_server.serve'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
Welcome to the check env shell.Type help to list commands.
(Cmd) check_pipeline
Pipeline cpu environment running success
Pipeline gpu environment running success
```
运行失败时,错误信息会记录到当前目录下 `stderr.log` 文件 和 `Pipeline_test_cpu/PipelineServingLogs` 目录下。用户可根据错误信息调试。
```
(Cmd) check_all
PaddlePaddle inference environment running success
C++ cpu environment running success
C++ gpu environment running failure, if you need this environment, please refer to https://github.com/PaddlePaddle/Serving/blob/develop/doc/Install_CN.md
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/serve.py", line 541, in <module>
Check_Env_Shell().cmdloop()
File "/usr/local/lib/python3.7/cmd.py", line 138, in cmdloop
stop = self.onecmd(line)
File "/usr/local/lib/python3.7/cmd.py", line 217, in onecmd
return func(arg)
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/serve.py", line 501, in do_check_all
check_env("all")
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 94, in check_env
run_test_cases(pipeline_test_cases, "Pipeline", is_open_std)
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 66, in run_test_cases
mv_log_to_new_dir(new_dir_path)
File "/usr/local/lib/python3.7/site-packages/paddle_serving_server/env_check/run.py", line 48, in mv_log_to_new_dir
shutil.move(file_path, dir_path)
File "/usr/local/lib/python3.7/shutil.py", line 555, in move
raise Error("Destination path '%s' already exists" % real_dst)
shutil.Error: Destination path '/home/work/Pipeline_test_cpu/PipelineServingLogs' already exists
```
<a name="2"></a>
## 服务启动与关闭
服务启动需要三类文件,PYTHON 程序、模型文件和配置文件。以[Python Pipeline 快速部署案例](../Quick_Start_CN.md)为例,
```
.
├── config.yml
├── imgs
│   └── ggg.png
├── ocr_det_client
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── ocr_det_model
│   ├── inference.pdiparams
│   ├── inference.pdmodel
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── ocr_det.tar.gz
├── ocr_rec_client
│   ├── serving_client_conf.prototxt
│   └── serving_client_conf.stream.prototxt
├── ocr_rec_model
│   ├── inference.pdiparams
│   ├── inference.pdmodel
│   ├── serving_server_conf.prototxt
│   └── serving_server_conf.stream.prototxt
├── pipeline_http_client.py
├── pipeline_rpc_client.py
├── ppocr_keys_v1.txt
└── web_service.py
```
启动服务端程序运行 `web_service.py`,启动客户端程序运行 `pipeline_http_client.py``pipeline_rpc_client.py`。服务端启动的日志信息在 `PipelineServingLogs` 目录下可用于调试。
```
├── PipelineServingLogs
│   ├── pipeline.log
│   ├── pipeline.log.wf
│   └── pipeline.tracer
```
关闭程序可使用2种方式,
- 前台关闭程序:`Ctrl+C` 关停服务
- 后台关闭程序:
```python
python3 -m paddle_serving_server.serve stop # 触发 SIGINT 信号
python3 -m paddle_serving_server.serve kill # 触发 SIGKILL 信号,强制关闭
```
<a name="3"></a>
## 本地与远程推理
本地推理是指在服务所在机器环境下开启多进程推理,而远程推理是指本地服务请求远程 C++ Serving 推理服务。
本地推理的优势是实现简单,一般本地处理相比于远程推理耗时更低。而远程推理的优势是可实现 Python Pipeline 较难实现的功能,如部署加密模型,大模型推理。
Python Pipeline 的本地推理可参考如下配置,在 `uci` op 中 增加 `local_service_conf` 配置,并设置 `client_type: local_predictor`
```
op:
uci:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 10
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#uci模型路径
model_config: uci_housing_model
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "" # "0,1"
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["price"]
```
Python Pipeline 的远程推理可参考如下配置,设置 `client_type: brpc``server_endpoints``timeout` 和本地 `client_config`
```
op:
bow:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
#client连接类型,brpc
client_type: brpc
#Serving交互重试次数,默认不重试
retry: 1
#Serving交互超时时间, 单位ms
timeout: 3000
#Serving IPs
server_endpoints: ["127.0.0.1:9393"]
#bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["prediction"]
```
<a name="4"></a>
## 批量推理
Pipeline 支持批量推理,通过增大 batch size 可以提高 GPU 利用率。Python Pipeline 支持3种 batch 形式以及适用的场景如下:
- 场景1:客户端打包批量数据(Client Batch)
- 场景2:服务端合并多个请求动态合并批量(Server auto-batching)
- 场景3:拆分一个大批量的推理请求为多个小批量推理请求(Server mini-batch)
<a name="4.1"></a>
**一.客户端打包批量数据**
当输入数据是 numpy 类型,如shape 为[4, 3, 512, 512]的 numpy 数据,即4张图片,可直接作为输入数据。
当输入数据的 shape 不同时,需要按最大的shape的尺寸 Padding 对齐后发送给服务端
<a name="4.2"></a>
**二.服务端合并多个请求动态合并批量**
有助于提升吞吐和计算资源的利用率,当多个请求的 shape 尺寸不相同时,不支持合并。当前有2种合并策略,分别是:
- 等待时间与最大批量结合(推荐):结合`batch_size``auto_batching_timeout`配合使用,实际请求的批量条数超过`batch_size`时会立即执行,不超过时会等待`auto_batching_timeout`时间再执行
```
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc, grpc和local_predictor
client_type: brpc
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
```
- 阻塞式等待:仅设置`batch_size`,不设置`auto_batching_timeout``auto_batching_timeout=0`,会一直等待接受 `batch_size` 个请求后再推理。
```
op:
bow:
# 并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
# client连接类型,brpc, grpc和local_predictor
client_type: brpc
# Serving IPs
server_endpoints: ["127.0.0.1:9393"]
# bow模型client端配置
client_config: "imdb_bow_client_conf/serving_client_conf.prototxt"
# 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout,否则不足batch_size时会阻塞
batch_size: 2
# 批量查询超时,与batch_size配合使用
auto_batching_timeout: 2000
```
<a name="4.3"></a>
**三.Mini-Batch**
拆分一个批量数据推理请求成为多个小块推理:会降低批量数据 Padding 对齐的大小,从而提升速度。可参考 [OCR 示例](),核心思路是拆分数据成多个小批量,放入 list 对象 feed_list 并返回
```
def preprocess(self, input_dicts, data_id, log_id):
(_, input_dict), = input_dicts.items()
raw_im = input_dict["image"]
data = np.frombuffer(raw_im, np.uint8)
im = cv2.imdecode(data, cv2.IMREAD_COLOR)
dt_boxes = input_dict["dt_boxes"]
dt_boxes = self.sorted_boxes(dt_boxes)
feed_list = []
img_list = []
max_wh_ratio = 0
## Many mini-batchs, the type of feed_data is list.
max_batch_size = len(dt_boxes)
# If max_batch_size is 0, skipping predict stage
if max_batch_size == 0:
return {}, True, None, ""
boxes_size = len(dt_boxes)
batch_size = boxes_size // max_batch_size
rem = boxes_size % max_batch_size
for bt_idx in range(0, batch_size + 1):
imgs = None
boxes_num_in_one_batch = 0
if bt_idx == batch_size:
if rem == 0:
continue
else:
boxes_num_in_one_batch = rem
elif bt_idx < batch_size:
boxes_num_in_one_batch = max_batch_size
else:
_LOGGER.error("batch_size error, bt_idx={}, batch_size={}".
format(bt_idx, batch_size))
break
start = bt_idx * max_batch_size
end = start + boxes_num_in_one_batch
img_list = []
for box_idx in range(start, end):
boximg = self.get_rotate_crop_image(im, dt_boxes[box_idx])
img_list.append(boximg)
h, w = boximg.shape[0:2]
wh_ratio = w * 1.0 / h
max_wh_ratio = max(max_wh_ratio, wh_ratio)
_, w, h = self.ocr_reader.resize_norm_img(img_list[0],
max_wh_ratio).shape
imgs = np.zeros((boxes_num_in_one_batch, 3, w, h)).astype('float32')
for id, img in enumerate(img_list):
norm_img = self.ocr_reader.resize_norm_img(img, max_wh_ratio)
imgs[id] = norm_img
feed = {"x": imgs.copy()}
feed_list.append(feed)
return feed_list, False, None, ""
```
<a name="5"></a>
## 单机多卡推理
单机多卡推理与 `config.yml` 中配置4个参数关系紧密,`is_thread_op``concurrency``device_type``devices`,必须在进程模型和 GPU 模式,每张卡上可分配多个进程,即 M 个 Op 进程与 N 个 GPU 卡绑定。
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
op:
det:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 6
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
client_type: local_predictor
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
# 计算硬件 ID,当 devices 为""或不写时为 CPU 预测;当 devices 为"0", "0,1,2"时为 GPU 预测,表示使用的 GPU 卡
devices: "0,1,2"
```
以上述案例为例,`concurrency:6`,即启动6个进程,`devices:0,1,2`,根据轮询分配机制,得到如下绑定关系:
- 进程ID: 0 绑定 GPU 卡0
- 进程ID: 1 绑定 GPU 卡1
- 进程ID: 2 绑定 GPU 卡2
- 进程ID: 3 绑定 GPU 卡0
- 进程ID: 4 绑定 GPU 卡1
- 进程ID: 5 绑定 GPU 卡2
- 进程ID: 6 绑定 GPU 卡0
对于更灵活的进程与 GPU 卡绑定方式,会持续开发。
<a name="6"></a>
## 多种计算芯片上推理
除了支持 CPU、GPU 芯片推理之外,Python Pipeline 还支持在多种计算硬件上推理。根据 `config.yml` 中的 `device_type``devices`来设置推理硬件和加速库如下:
- CPU(Intel) : 0
- GPU(GPU / Jetson / 海光 DCU) : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
- Ascend310 : 5
- ascend910 : 6
当不设置`device_type`时,根据 `devices` 来设置,即当 `device_type` 为 "" 或空缺时为 CPU 推理;当有设定如"0,1,2"时,为 GPU 推理,并指定 GPU 卡。
以使用 XPU 的编号为0卡为例,配合 `ir_optim` 一同开启,`config.yml`详细配置如下:
```
# 计算硬件类型
device_type: 4
# 计算硬件ID,优先由device_type决定硬件类型
devices: "0"
# 开启ir优化
ir_optim: True
```
<a name="7"></a>
## TensorRT 推理加速
TensorRT 是一个高性能的深度学习推理优化器,在 Nvdia 的 GPU 硬件平台运行的推理框架,为深度学习应用提供低延迟、高吞吐率的部署推理。
通过设置`device_type``devices``ir_optim` 字段即可实现 TensorRT 高性能推理。必须同时设置 `ir_optim: True` 才能开启 TensorRT。
```
op:
imagenet:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#uci模型路径
model_config: serving_server/
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 2
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "1" # "0,1"
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["score"]
#开启 ir_optim
ir_optim: True
```
<a name="8"></a>
## MKL-DNN 推理加速
MKL-DNN 针对 Intel CPU 和 GPU 的数学核心库,对深度学习网络进行算子和指令集的性能优化,从而提升执行速度。Paddle 框架已集成了 MKL-DNN。
目前仅支持 Intel CPU 推理加速,通过设置`device_type``devices``use_mkldnn` 字段使用 MKL-DNN。
```
op:
imagenet:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#uci模型路径
model_config: serving_server/
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["score"]
#开启 MKLDNN
use_mkldnn: True
```
<a name="9"></a>
## 低精度推理
Pipeline Serving支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下图所示:
低精度推理需要有量化模型,配合`config.yml`配置一起使用,以[低精度示例](../Low_Precision_CN.md) 为例
<a name="9.1"></a>
**一.CPU 低精度推理**
通过设置,`device_type``devices` 字段使用 CPU 推理,通过调整`precision``thread_num``use_mkldnn`参数选择低精度和性能调优。
```
op:
imagenet:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#uci模型路径
model_config: serving_server/
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: ""
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["score"]
#精度,CPU 支持: "fp32"(default), "bf16"(mkldnn); 不支持: "int8"
precision: "bf16"
#CPU 算数计算线程数,默认4线程
thread_num: 10
#开启 MKLDNN
use_mkldnn: True
```
<a name="9.2"></a>
**二.GPU 和 TensorRT 低精度推理**
通过设置`device_type``devices` 字段使用原生 GPU 或 TensorRT 推理,通过调整`precision``ir_optim``use_calib`参数选择低精度和性能调优,如开启 TensorRT,必须一同开启`ir_optim``use_calib`仅配合 int8 使用。
```
op:
imagenet:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 1
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#uci模型路径
model_config: serving_server/
#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 2
#计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡
devices: "1" # "0,1"
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["score"]
#精度,GPU 支持: "fp32"(default), "fp16", "int8"
precision: "int8"
#开启 TensorRT int8 calibration
use_calib: True
#开启 ir_optim
ir_optim: True
```
<a name="9.3"></a>
**三.性能测试**
测试环境如下:
- GPU 型号: A100-40GB
- CPU 型号: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz * 160
- CUDA: CUDA Version: 11.2
- CuDNN: 8.0
测试方法:
- 模型: Resnet50 量化模型
- 部署方法: Python Pipeline 部署
- 计时方法: 刨除第一次运行初始化,运行100次计算平均值
在此环境下测试不同精度推理结果,GPU 推理性能较好的配置是
- GPU + int8 + ir_optim + TensorRT + use_calib : 15.1 ms
- GPU + fp16 + ir_optim + TensorRT : 17.2 ms
CPU 推理性能较好的配置是
- CPU + bf16 + MKLDNN : 18.2 ms
- CPU + fp32 + thread_num=10 : 18.4 ms
完整性能指标如下:
<div align=center>
<img src='../images/low_precision_profile.png' height = "600" align="middle"/>
</div
<a name="10"></a>
## 复杂图结构 DAG 跳过某个 Op 运行
此应用场景一般在 Op 前后处理中有 if 条件判断时,不满足条件时,跳过后面处理。实际做法是在跳过此 Op 的 process 阶段,只要在 preprocess 做好判断,跳过 process 阶段,在和 postprocess 后直接返回即可。
preprocess 返回结果列表的第二个结果是 `is_skip_process=True` 表示是否跳过当前 Op 的 process 阶段,直接进入 postprocess 处理。
```python
## Op::preprocess() 函数实现
def preprocess(self, input_dicts, data_id, log_id):
"""
In preprocess stage, assembling data for process stage. users can
override this function for model feed features.
Args:
input_dicts: input data to be preprocessed
data_id: inner unique id
log_id: global unique id for RTT
Return:
input_dict: data for process stage
is_skip_process: skip process stage or not, False default
prod_errcode: None default, otherwise, product errores occured.
It is handled in the same way as exception.
prod_errinfo: "" default
"""
# multiple previous Op
if len(input_dicts) != 1:
_LOGGER.critical(
self._log(
"Failed to run preprocess: this Op has multiple previous "
"inputs. Please override this func."))
os._exit(-1)
(_, input_dict), = input_dicts.items()
return input_dict, False, None, ""
```
以下示例 Jump::preprocess() 重载了原函数,返回了 True 字段
```python
class JumpOp(Op):
## Overload func JumpOp::preprocess
def preprocess(self, input_dicts, data_id, log_id):
(_, input_dict), = input_dicts.items()
if input_dict.has_key("jump"):
return input_dict, True, None, ""
else
return input_dict, False, None, ""
```
# Python Pipeline 框架
在许多深度学习框架中,模型服务化部署通常用于单模型的一键部署。但在 AI 工业大生产的背景下,端到端的单一深度学习模型不能解决复杂问题,多个深度学习模型组合使用是解决现实复杂问题的常规手段,如文字识别 OCR 服务至少需要检测和识别2种模型;视频理解服务一般需要视频抽帧、切词、音频处理、分类等多种模型组合实现。当前,通用多模型组合服务的设计和实现是非常复杂的,既要能实现复杂的模型拓扑关系,又要保证服务的高并发、高可用和易于开发和维护等。
Paddle Serving 实现了一套通用的多模型组合服务编程框架 Python Pipeline,不仅解决上述痛点,同时还能大幅提高 GPU 利用率,并易于开发和维护。
Python Pipeline 使用案例请阅读[Python Pipeline 快速部署案例](../Quick_Start_CN.md)
通过阅读以下内容掌握 Python Pipeline 核心功能和使用方法、高阶功能用法和性能优化指南等。
- [Python Pipeline 框架设计](./Pipeline_Design_CN.md)
- [Python Pipeline 核心功能](./Pipeline_Features_CN.md)
- [Python Pipeline 优化指南](./Pipeline_Optimize_CN.md)
- [Python Pipeline 性能指标](./Pipeline_Benchmark_CN.md)
# Python Pipeline 优化指南
- [优化响应时长](#1)
- [1.1 分析响应时长](#1.1)
- [Pipeline Trace Tool](#1.1.1)
- [Pipeline Profile Tool](#1.1.2)
- [1.2 优化思路](#1.2)
- [优化服务吞吐](#2)
- [2.1 分析吞吐瓶颈](#2.1)
- [2.2 优化思路](#2.2)
- [增加 Op 并发](#2.2.1)
- [动态批量](#2.2.2)
- [CPU 与 GPU 处理分离](#2.2.3)
通常,服务的性能优化是基于耗时分析,首先要掌握服务运行的各阶段耗时信息,从中找到耗时最长的性能瓶颈再做针对性优化。对于模型推理服务化不仅要关注耗时,由于 GPU 芯片昂贵,更要关注服务吞吐,从而提升 GPU 利用率实现降本增效。因此,模型推理服务化可总结为:
- 优化响应时长
- 优化服务吞吐
经过分析和调优后,各个阶段实现整体服务的性能最优。
<a name="1"></a>
## 优化响应时长
首先,优化响应时长的主要思路首先要掌握各阶段耗时,并分析出性能瓶颈或者耗时占比较高的阶段,再针对性能瓶颈做专项优化。
Paddle Serving 提供2种耗时分析工具,`Pipeline Trace Tool``Pipeline Profile Tool`。2个工具的特点如下:
- Pipeline Trace Tool : 统计服务端所有进程各个阶段的平均耗时,包括每个 `Op``Channel`,用于定量分析。
- Pipeline Profile Tool : 是可视化 Trace View 工具,生成多进程并发效果图,用定性和定量分析执行和并发效果。
<a name="1.1"></a>
**一.耗时分析**
<a name="1.1.1"></a>
1.Pipeline Trace Tool
`Pipeline Trace Tool` 统计每个 `Op``Channel` 中各阶段的处理耗时,
开启方法在配置文件 `config.yml``dag` 区段内添加 `tracer` 字段,框架会每隔 `interval_s` 时间生成 Trace 信息。
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: True
#tracer, 跟踪框架吞吐,每个OP和channel的工作情况。无tracer时不生成数据
tracer:
#每次trace的时间间隔,单位秒/s
interval_s: 10
```
生成的 Trace 信息保存在 `./PipelineServingLogs/pipeline.tracer` 日志中。如下图所示
```
==================== TRACER ======================
Op(uci):
in[8473.507333333333 ms]: # 等待前置 Channel 中数据放入 Op 的耗时,如长时间无请求,此值会变大
prep[0.6753333333333333 ms] # 推理前处理 preprocess 阶段耗时
midp[26.476333333333333 ms] # 推理 process 阶段耗时
postp[1.8616666666666666 ms] # 推理后处理 postprocess 阶段耗时
out[1.3236666666666668 ms] # 后处理结果放入后置 channel 耗时
idle[0.9965882097324374] # 框架自循环耗时,间隔 1 ms,如此值很大说明系统负载高,调度变慢
DAGExecutor:
Query count[30] # interval_s 间隔时间内请求数量
QPS[27.35 q/s] # interval_s 间隔时间内服务 QPS
Succ[1.0] # interval_s 间隔时间内请求成功率
Error req[] # 异常请求信息
Latency:
ave[36.55233333333334 ms] # 平均延时
.50[8.702 ms] # 50分位延时
.60[8.702 ms] # 60分位延时
.70[92.346 ms] # 70分位延时
.80[92.346 ms] # 70分位延时
.90[92.346 ms] # 90分位延时
.95[92.346 ms] # 95分位延时
.99[92.346 ms] # 99分位延时
Channel (server worker num[1]):
chl0(In: ['@DAGExecutor'], Out: ['uci']) size[0/0] # 框架 RequestOp 与 uci Op 之间 Channel 中堆积请求数。此值较大,说明下游 uci Op 消费能力不足。
chl1(In: ['uci'], Out: ['@DAGExecutor']) size[0/0] # uci Op 与 框架 ResponseOp 之间 Channel 中堆积的请求数。此值较大,说明下游 ReponseOp 消费能力不足。
==================== TRACER ======================
```
<a name="1.1.2"></a>
2.Pipeline Profile Tool
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: True
#使用性能分析, 默认为 False,imeline性能数据,对性能有一定影响
use_profile: True,
```
开启后,Server 端在预测的过程中会将对应的日志信息打印到`标准输出`,为了更直观地展现各阶段的耗时,因此服务启动要使用如下命令:
```
python3.7 web_service.py > profile.txt 2>&1
```
服务接收请求后,输出 Profile 信息到 `profile.txt` 文件中。再粘贴如下代码到 `trace.py`, 使用框架提供 Analyst 模块对日志文件做进一步的分析处理。
```
from paddle_serving_server.pipeline import Analyst
import json
import sys
if __name__ == "__main__":
log_filename = "profile.txt"
trace_filename = "trace"
analyst = Analyst(log_filename)
analyst.save_trace(trace_filename)
```
运行命令,脚本将日志中的时间打点信息转换成 json 格式保存到 `trace` 文件。
```
python3.7 trace.py
```
`trace` 文件可以通过 `chrome` 浏览器的 `tracing` 功能进行可视化。
```
打开 chrome 浏览器,在地址栏输入 chrome://tracing/ ,跳转至 tracing 页面,点击 load 按钮,打开保存的 trace 文件,即可将预测服务的各阶段时间信息可视化。
```
通过图示中并发请求的处理流程可观测到推理阶段的流水线状态,以及多个请求在推理阶段的`间隔`信息,进行优化。
<a name="1.2"></a>
**二.降低响应时长优化思路**
根据 `Pipeline Trace Tool` 输出结果在不同阶段耗时长的问题,常见场景的优化方法如下:
- Op 推理阶段(midp) 耗时长:
- 增加 Op 并发度
- 开启 auto-batching (前提是多个请求的 shape 一致)
- 若批量数据中某条数据的 shape 很大,padding 很大导致推理很慢,可参考 OCR 示例中 mini-batch 方法。
- 开启 TensorRT/MKL-DNN 优化
- 开启低精度推理
- Op 前处理阶段(prep) 或 后处理阶段耗时长:
- 增加 OP 并发度
- 优化前后处理逻辑
- in/out 耗时长(channel 堆积>5)
- 检查 channel 传递的数据大小,可能为传输的数据大导致延迟大。
- 优化传入数据,不传递数据或压缩后再传入
- 增加 Op 并发度
- 减少上游 Op 并发度
根据 `Pipeline Profile Tool` 输出结果优化流水行并发的效果
- 增加 Op 并发度,或调整不同 Op 的并发度
- 开启 auto-batching
此外,还有一些优化思路,如将 CPU 处理较慢的过程转换到 GPU 上处理等,客户端与服务端传输较大数据时,可使用共享内存方式传递内存或显存地址等。
<a name="2"></a>
## 优化服务吞吐
<a name="2.1"></a>
**一.分析吞吐瓶颈**
服务的吞吐量受到多种多因素条件制约,如 Op 处理时长、传输数据耗时、并发数和 DAG 图结构等,可以将这些因素进一步拆解,当传输数据不是极端庞大的时候,最重要因素是流水线中`最慢 Op 的处理时长和并发数`
```
Op 处理时长:
op_cost = process(pre + mid + post)
服务吞吐量:
service_throughput = 1 / 最慢 op_cost * 并发数
服务平响:
service_avg_cost = ∑op_concurrency 【关键路径】
批量预测平均耗时:
avg_batch_cost = (N * pre + mid + post) / N
```
<a name="2.2"></a>
**二.优化思路**
优化吞吐的主要方法是 `增大 Op 并发数``自动批量``CPU 与 GPU 处理分离`
<a name="2.2.1"></a>
1.增加 Op 并发**
调整 Op 的并发数量通过设置 `is_thread_op: False` 进程类型 Op 和 `uci` Op 的 `concurrency` 字段
```
dag:
#op资源类型, True, 为线程模型;False,为进程模型
is_thread_op: False
op:
uci:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 10
```
Op 的进程数量不是越大越好,受到机器 CPU 核数、内存和显存大小的限制,推荐设置 Op 的并发数不超过系统 CPU 核数。
<a name="2.2.2"></a>
2.动态批量
动态批量是增加吞吐的有一种方法,开启方式可参考[Python Pipeline 核心功能](./Pipeline_Features_CN.md#批量推理)
<a name="2.2.3"></a>
3.CPU 与 GPU 处理分离
`CV` 模型中,对图片或视频的前后处理成为主要瓶颈时,可考虑此方案,即将前后处理过程独立成一个 Op 并独立设置并发度。
将 CPU 前后处理和 GPU 推理过程比例调整到服务最佳配比。以 OCR 为例,原有流水线设计为 `RequestOp -> DetOp -> RecOp -> ResponseOp`
根据耗时分析,`DetOp``RecOp` 的前处理耗时很长,因此,将2个模型前处理分离成独立 Op,最新的流水线设计为:
`RequestOp -> PreDetOp -> DetOp -> PreRecOp -> RecOp -> ResponseOp`,并调大 `PreDetOp``PreRecOp`的并发度,从而获得 20% 的性能提升。
由于增加了2次数据传递,单条请求的处理延时会增加。
## 在Kubenetes集群上部署Paddle Serving
# Kubernetes 集群部署
Paddle Serving在0.6.0版本开始支持在Kubenetes集群上部署,并提供反向代理和安全网关支持。与Paddle Serving在Docker镜像中开发类似,Paddle Serving 模型在Kubenetes集群部署需要制作轻量化的运行镜像,并使用kubectl工具在集群上部署
Kubernetes 是一个基于容器技术的分布式架构的解决方案,是云原生容器集群管理系统,提供服务发现与负载均衡、存储编排、自动部署和回滚、资源管理、自动恢复以及密钥和配置管理。Paddle Serving 支持 Kubenetes 集群部署方案,为企业级用户提供集群部署示例
### 1.集群准备
## 部署方案
如果您还没有Kubenetes集群,我们推荐[购买并使用百度智能云CCE集群](https://cloud.baidu.com/doc/CCE/index.html). 如果是其他云服务商提供的集群,或者自行安装Kubenetes集群,请遵照对应的教程。
为了解决 Pod 迁移、Node Pod 端口、域名动态分配等问题,选择使用 Ingress 解决方案,对外提供可访问的 URL、负载均衡、SSL、基于名称的虚拟主机等功能。在众多 Ingress 插件中选用 Kong 作为微服务的 API 网关,因其具备以下优势:
- 拥有丰富的微服务功能,如 API认证、鉴权、DDos保护和灰度部署等
- 提供一些 API、服务的定义,可抽象成 Kubernetes 的 CRD,通过 Kubernetes Ingress 配置实现同步状态到 Kong 集群
- 集群配置信息存储在 postgres 数据库,配置信息实现全局节点共享和实时同步
- 有成熟的第三方管理 UI,实现可视化管理 Kong 配置
您还需要准备一个用于Kubenetes集群部署使用的镜像仓库,通常与云服务提供商绑定,如果您使用的是百度智能云的CCE集群,可以参照[百度智能云CCR镜像仓库使用方式](https://cloud.baidu.com/doc/CCR/index.html)。当然Docker Hub也可以作为镜像仓库,但是可能在部署时会出现下载速度慢的情况
Paddle Serving 的 Kubernetes 集群部署方案设计如下图所示,用户流量通过 Kong Ingress 转发到 Kubernetes 集群。Kubernetes 集群负责管理 Service 和 Pod 实例
### 2.环境准备
<p align="center">
<img src="../images/kubernetes_design.png" width="800">
</p>
需要在Kubenetes集群上安装网关工具KONG。
## 部署步骤
```
kubectl apply -f https://bit.ly/kong-ingress-dbless
```
### 选择Serving开发镜像 (可选)
您可以直接选择已生成的Serving [DOCKER开发镜像列表](./Docker_Images_CN.md)作为Kubernetes部署的首选,携带了开发工具,可用于调试和编译代码。
**一. 准备环境**
### 制作Serving运行镜像(可选)
推荐[购买并使用百度智能云 CCE 集群](https://cloud.baidu.com/doc/CCE/index.html),提供完整的部署环境。如自行安装 Kubenetes 集群,请参考[教程](https://kubernetes.io/zh/docs/setup/)
[DOCKER开发镜像列表](./Docker_Images_CN.md)文档相比,开发镜像用于调试、编译代码,携带了大量的开发工具,因此镜像体积较大。运行镜像通常容器体积更小的轻量级容器,可在边缘端设备上部署。如您不需要轻量级运行容器,请直接跳过这一部分
此外,还需要准备一个用于 Kubenetes 集群部署的镜像仓库,通常与云服务提供商绑定,如果使用百度智能云的CCE集群,可以参照[百度智能云 CCR 镜像仓库使用方式](https://cloud.baidu.com/doc/CCR/index.html)。当然 Docker Hub 也可以作为镜像仓库,但下载速度慢,集群扩容时间较长
我们提供了运行镜像的生成脚本在Serving代码库下`tools/generate_runtime_docker.sh`文件,通过以下命令可生成代码。
在 Kubenetes 集群中运行下面命令,安装网关工具 Kong
```bash
bash tools/generate_runtime_docker.sh --env cuda10.1 --python 3.7 --image_name serving_runtime:cuda10.1-py37 --paddle 2.2.0 --serving 0.8.0
```
kubectl apply -f https://bit.ly/kong-ingress-dbless
```
会生成 cuda10.1,python 3.7,serving版本0.8.0 还有 paddle版本2.2.2的运行镜像。如果有其他疑问,可以执行下列语句得到帮助信息。强烈建议您使用最新的paddle和serving的版本(2个版本是对应的如paddle 2.2.0 与serving 0.7.x对应,paddle 2.2.2 与 serving 0.8.x对应),因为更早的版本上出现的错误只在最新版本修复,无法在历史版本中修复。
**二. 安装 Kubernetes **
kubernetes 集群环境安装和启动步骤如下,并使用 kubectl 命令与通过它与 Kubernetes 进行交互和管理。
```
bash tools/generate_runtime_docker.sh --help
// close OS firewall
systemctl disable firewarlld
systemctl stop firewarlld
// install etcd & kubernetes
yum install -y etcd kubernetes
// start etcd & kubernetes
systemctl start etcd
systemctl start docker
systemctl start kube-apiserver
systemctl start kube-controller-manager
systemctl start kube-scheduler
systemctl start kubelet
systemctl start kube-proxy
```
运行镜像会携带以下组建在运行镜像中
- paddle-serving-server, paddle-serving-client,paddle-serving-app,paddlepaddle,具体版本可以在tools/runtime.dockerfile当中查看,同时,如果有定制化的需求,也可以在该文件中进行定制化。
- paddle-serving-server 二进制可执行程序
**二. 制作镜像**
也就是说,运行镜像在生成之后,我们只需要将我们运行的代码(如果有)和模型搬运到镜像中就可以。生成后的镜像名为`paddle_serving:cuda10.2-py37`
首先,可直接使用 Paddle Serving 提供的镜像作为 Base 制作业务镜像,或者重新制作镜像。Paddle Serving 提供以下3种镜像,区别如下:
- 开发镜像:安装多种开发工具,可用于调试和编译代码,镜像体积较大。
- 运行镜像:安装运行 Serving 的必备工具,经过裁剪后镜像体积较小,适合在存储受限场景使用
- Java 镜像:为 Java SDK 提供基础环境,包括 JRE、JDK 和 Maven
- XPU 镜像:为 Arm 或 异构硬件(百度昆仑、海光DCU)环境部署
### 添加您的代码和模型
完整镜像列表,请参考 [DOCKER 开发镜像列表](./Docker_Images_CN.md)
在刚才镜像的基础上,我们需要先收集好运行文件。这取决于您是如何使用PaddleServing的
制作镜像的整体步骤如下,这里选定 Serving 运行镜像,相比于开发镜像体积更小,镜像内已安装相关的依赖和 Serving wheel 包。
1.选定运行镜像:registry.baidubce.com/paddlepaddle/serving:0.8.3-cuda10.1-cudnn7-runtime
2.运行镜像并拷贝模型和服务代码到镜像中,当你需要部署外部其他模型时,更换模型和代码即可。
3.制作并上传新镜像
#### Pipeline模式:
假定已完成上述3个前置运行镜像并拷贝模型到镜像中,看具体操作。
```bash
# Run docker
nvidia-docker run --rm -dit --name pipeline_serving_demo registry.baidubce.com/paddlepaddle/serving:0.8.0-cuda10.1-cudnn7-runtime bash
对于pipeline模式,我们需要确保模型和程序文件、配置文件等各种依赖都能够在镜像中运行。因此可以在`/home/project`下存放我们的执行文件时,我们以`Serving/examples/Pipeline/PaddleOCR/ocr`为例,这是OCR文字识别任务。
# Enter your serving repo, and download OCR models
cd /home/work/Serving/examples/Pipeline/PaddleOCR/ocr
```bash
#假设您已经拥有Serving运行镜像,假设镜像名为paddle_serving:cuda10.2-py36
docker run --rm -dit --name pipeline_serving_demo paddle_serving:cuda10.2-py36 bash
cd Serving/examples/Pipeline/PaddleOCR/ocr
# get models
python -m paddle_serving_app.package --get_model ocr_rec
python3 -m paddle_serving_app.package --get_model ocr_rec
tar -xzvf ocr_rec.tar.gz
python -m paddle_serving_app.package --get_model ocr_det
python3 -m paddle_serving_app.package --get_model ocr_det
tar -xzvf ocr_det.tar.gz
cd ..
# Copy OCR directory to your docker
docker cp ocr pipeline_serving_demo:/home/
docker commit pipeline_serving_demo ocr_serving:latest
```
其中容器名`paddle_serving_demo`和最终的镜像名`ocr_serving:latest`都可以自行定义,最终通过`docker push`来推到云端的镜像仓库。至此,部署前的最后一步工作已完成。
**提示:如果您对runtime镜像是否可运行需要验证,可以执行**
# Commit and push it
docker commit pipeline_serving_demo registry.baidubce.com/paddlepaddle/serving:k8s_ocr_pipeline_0.8.3_post101
docker push registry.baidubce.com/paddlepaddle/serving:k8s_ocr_pipeline_0.8.3_post101
```
最终,你完成了业务镜像制作环节。通过拉取制作的镜像,创建Docker示例后,在`/home`路径下验证模型目录,通过以下命令验证 Wheel 包安装。
```
docker exec -it pipeline_serving_demo bash
cd /home/ocr
python3.6 web_service.py
pip3.7 list | grep paddle
```
进入容器到工程目录之后,剩下的操作和调试代码的工作是类似的。
**为了方便您对照,我们也提供了示例镜像registry.baidubce.com/paddlepaddle/serving:k8s-pipeline-demo**
#### WebService模式:
web service模式本质上和pipeline模式类似,因此我们以`Serving/examples/C++/PaddleNLP/bert`为例
```bash
#假设您已经拥有Serving运行镜像,假设镜像名为registry.baidubce.com/paddlepaddle/serving:0.8.0-cpu-py36
docker run --rm -dit --name webservice_serving_demo registry.baidubce.com/paddlepaddle/serving:0.8.0-cpu-py36 bash
cd Serving/examples/C++/PaddleNLP/bert
### download model
wget https://paddle-serving.bj.bcebos.com/paddle_hub_models/text/SemanticModel/bert_chinese_L-12_H-768_A-12.tar.gz
tar -xzf bert_chinese_L-12_H-768_A-12.tar.gz
mv bert_chinese_L-12_H-768_A-12_model bert_seq128_model
mv bert_chinese_L-12_H-768_A-12_client bert_seq128_client
sh get_data.sh
cd ..
docker cp bert webservice_serving_demo:/home/
docker commit webservice_serving_demo bert_serving:latest
输出显示已安装3个 Serving Wheel 包和1个 Paddle Wheel 包。
```
**提示:如果您对runtime镜像是否可运行需要验证,可以执行**
```bash
docker exec -it webservice_serving_demo bash
cd /home/bert
python3.6 bert_web_service.py bert_seq128_model 9292
paddle-serving-app 0.8.3
paddle-serving-client 0.8.3
paddle-serving-server-gpu 0.8.3.post101
paddlepaddle-gpu 2.2.2.post101
```
进入容器到工程目录之后,剩下的操作和调试代码的工作是类似的。
**为了方便您对照,我们也提供了示例镜像registry.baidubce.com/paddlepaddle/serving:k8s-web-demo**
**三. 集群部署**
Serving/tools/generate_k8s_yamls.sh 会生成 Kubernetes 部署配置。以 OCR 为例,运行以下命令生成 Kubernetes 集群配置。
```
sh tools/generate_k8s_yamls.sh --app_name ocr --image_name registry.baidubce.com/paddlepaddle/serving:k8s_ocr_pipeline_0.8.3_post101 --workdir /home/ocr --command "python3.7 web_service.py" --port 9999
```
生成信息如下:
```
named arg: app_name: ocr
named arg: image_name: registry.baidubce.com/paddlepaddle/serving:k8s_ocr_pipeline_0.8.3_post101
named arg: workdir: /home/ocr
named arg: command: python3.7 web_service.py
named arg: port: 9999
check k8s_serving.yaml and k8s_ingress.yaml please.
```
运行命令后,生成2个 yaml 文件,分别是 k8s_serving.yaml 和 k8s_ingress.yaml。执行以下命令启动 Kubernetes 集群 和 Ingress 网关。
### 在Kubenetes集群上部署
```
kubectl create -f k8s_serving.yaml
kubectl create -f k8s_ingress.yaml
```
kubenetes集群操作需要`kubectl`去操纵yaml文件。我们这里给出了三个部署的例子,他们分别是
Kubernetes 下常用命令
| 命令 | 说明 |
| --- | --- |
| kubectl create -f xxx.yaml | 使用 xxx.yml 创建资源对象 |
| kubectl apply -f xxx.yaml | 使用 xxx.yml 更新资源对象 |
| kubectl delete po mysql| 删除名为 mysql 的 pods |
| kubectl get all --all-namespace | 查询所有资源信息 |
| kubectl get po | 查询所有 pods |
| kubectl get namespace | 查询所有命名空间 |
| kubectl get rc | 查询所有|
| kubectl get services | 查询所有 services |
| kubectl get node | 查询所有 node 节点 |
| kubectl get deploy | 查询集群部署状态 |
- pipeline ocr示例
按下面4个步骤查询集群状态并进入 Pod 容器:
```bash
sh tools/generate_k8s_yamls.sh --app_name ocr --image_name registry.baidubce.com/paddlepaddle/serving:k8s-pipeline-demo --workdir /home/ocr --command "python3.6 web_service.py" --port 9999
1. 最终通过输入以下命令检验集群部署状态:
```
kubectl get deploy
- web service bert示例
```bash
sh tools/generate_k8s_yamls.sh --app_name bert --image_name registry.baidubce.com/paddlepaddle/serving:k8s-web-demo --workdir /home/bert --command "python3.6 bert_web_service.py bert_seq128_model 9292" --port 9292
```
**需要注意的是,app_name需要同URL的函数名相同。例如示例中bert的访问URL是`https://127.0.0.1:9292/bert/prediction`,那么app_name应为bert。**
接下来我们会看到有两个yaml文件,分别是`k8s_serving.yaml`和 k8s_ingress.yaml`.
为减少大家的阅读时间,我们只选择以pipeline为例。
```yaml
#k8s_serving.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: ocr
name: ocr
spec:
ports:
- port: 18080
name: http
protocol: TCP
targetPort: 18080
selector:
app: ocr
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: ocr
name: ocr
spec:
replicas: 1
selector:
matchLabels:
app: ocr
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: ocr
spec:
containers:
- image: registry.baidubce.com/paddlepaddle/serving:k8s-pipeline-demo
name: ocr
ports:
- containerPort: 18080
workingDir: /home/ocr
name: ocr
command: ['/bin/bash', '-c']
args: ["python3.6 bert_web_service.py bert_seq128_model 9292"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
resources: {}
```
```yaml
#kong_api.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: ocr
annotations:
kubernetes.io/ingress.class: kong
spec:
rules:
- http:
paths:
- path: /ocr
backend:
serviceName: ocr
servicePort: 18080
```
最终我们执行就可以启动相关容器和API网关。
```
部署状态如下:
```
kubectl apply -f k8s_serving.yaml
kubectl apply -f k8s_ingress.yaml
NAME READY UP-TO-DATE AVAILABLE AGE
ocr 1/1 1 1 10m
```
输入
2. 查询全部 Pod 信息 运行命令:
```
kubectl get deploy
kubectl get pods
```
可见
查询 Pod 信息如下:
```
NAME READY UP-TO-DATE AVAILABLE AGE
ocr 1/1 1 1 2d20h
NAME READY STATUS RESTARTS AGE
ocr-c5bd77d49-mfh72 1/1 Running 0 10m
uci-5bc7d545f5-zfn65 1/1 Running 0 52d
```
我们使用
3. 进入 Pod container 运行命令:
```
kubectl exec -ti ocr-c5bd77d49-mfh72 -n bash
```
4. 查询集群服务状态:
```
kubectl get service --all-namespaces
```
可以看到
集群部署状态如下:
```
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default bert ClusterIP 172.16.86.12 <none> 9292/TCP 20m
......@@ -255,16 +184,22 @@ kube-system kube-dns ClusterIP 172.16.0.10 <none>
kube-system metrics-server ClusterIP 172.16.34.157 <none> 443/TCP 28d
```
访问的方式就在
根据 kong-proxy 的 CLUSTER-IP 和 端口信息,访问 URL: http://172.16.88.132:80/ocr/prediction 查询 OCR 服务。
```:
http://${KONG_IP}:80/${APP_NAME}/prediction
```
例如Bert
**四.更新镜像**
假定更新了文件或数据,重新生成 k8s_serving.yaml 和 k8s_ingress.yaml。
```
sh tools/generate_k8s_yamls.sh --app_name ocr --image_name registry.baidubce.com/paddlepaddle/serving:k8s_ocr_pipeline_0.8.3_post101 --workdir /home/ocr --command "python3.7 web_service.py" --port 9999
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"words": "hello"}], "fetch":["pooled_output"]}' http://172.16.88.132:80/bert/prediction
更新配置,并重启Pod
```
kubectl apply -f k8s_serving.yaml
kubectl apply -f k8s_ingress.yaml
# 查找 ocr 的 pod name
kubectl get pods
就会从KONG的网关转发给bert服务。同理,OCR服务也可以把对应的IP地址换成`http://172.16.88.132:80/ocr/prediction`
# 更新 pod
kubectl exec -it ocr-c5bd77d49-s8jwh -n default -- /bin/sh
```
# 怎样保存用于Paddle Serving的模型?
# 保存用于 Serving 部署的模型参数
(简体中文|[English](./Save_EN.md))
- [背景介绍](#1)
- [功能设计](#2)
- [功能使用](#3)
- [PYTHON 命令执行](#3.1)
- [代码引入执行](#3.2)
- [Serving 部署](#4)
- [服务端部署示例](#4.1)
- [客户端部署示例](#4.2)
## 保存用于 Serving 部署模型的意义
<a name="1"></a>
## 背景介绍
模型参数信息保存在模型文件中,为什么还要保存用于 Paddle Serving 部署的模型参数呢,原因有3个:
1. 服务化场景分为客户端和服务端,服务端加载模型,而在客户端没有模型信息,但需要在客户端需实现数据拼装和类型转换。
2. 模型升级过程中 `feed vars``fetch vars` 的名称变化会导致代码升级,通过增加一个 `alias_name` 字段映射名称,代码无需升级。
3. 部署 `Web` 服务,并使用 `URL` 方式访问时,请求信息中缺少类型和维度信息,在服务端推理前需要进行转换。
<a name="2"></a>
## 从已保存的模型文件中导出
如果已使用Paddle 的`save_inference_model`接口保存出预测要使用的模型,你可以使用Paddle Serving提供的名为`paddle_serving_client.convert`的内置模块进行转换。
```python
python -m paddle_serving_client.convert --dirname ./your_inference_model_dir
## 功能设计
飞桨训推一体框架中,从动态图模型训练到静态图推理部署,一体化流程如下所示
```
①动态图训练 → ②模型动转静 → ③静态模型 → ④模型保存 → ⑤Serving 部署
```
在飞桨框架2.1对模型与参数的保存与载入相关接口进行了梳理,完整文档参考[模型保存与载入](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/02_paddle2.0_develop/08_model_save_load_cn.html)
- 对于训练调优场景,我们推荐使用 `paddle.save/load` 保存和载入模型;
- 对于推理部署场景,我们推荐使用 `paddle.jit.save/load`(动态图)和 `paddle.static.save/load_inference_model` (静态图)保存载入模型;
也可以通过Paddle Serving的`inference_model_to_serving`接口转换成可用于Paddle Serving的模型文件。
```python
import paddle_serving_client.io as serving_io
serving_io.inference_model_to_serving(dirname, serving_server="serving_server", serving_client="serving_client", model_filename=None, params_filename=None)
Paddle Serving 模型参数保存接口定位是在 `②模型动转静` 导出 `③静态模型`后,使用 `paddle.static.load_inference_model` 接口加载模型,和 `paddle.static.save_vars` 接口保存模型参数。
生成的模型参数信息保存在 `paddle_serving_server/client.prototxt` 文件中,其格式如下
```
feed_var {
name: "x"
alias_name: "image"
is_lod_tensor: false
feed_type: 1
shape: 3
shape: 960
shape: 960
}
fetch_var {
name: "save_infer_model/scale_0.tmp_1"
alias_name: "save_infer_model/scale_0.tmp_1"
is_lod_tensor: false
fetch_type: 1
shape: 1
shape: 960
shape: 960
}
```
模块参数与`inference_model_to_serving`接口参数相同。
| 参数 | 描述 |
|------|---------|
| name | 实际变量名 |
| alias_name | 变量别名,与 name 的关联业务场景中变量名 |
| is_lod_tensor | 是否为 LOD Tensor |
| feed_type | feed 变量类型|
| fetch_type | fetch 变量类型|
| shape 数组 | 变量的 Shape 信息 |
feed 与 fetch 变量的类型列表如下:
| 类型 | 类型值 |
|------|------|
| int64 | 0 |
| float32 |1 |
| int32 | 2 |
| float64 | 3 |
| int16 | 4 |
| float16 | 5 |
| bfloat16 | 6 |
| uint8 | 7 |
| int8 | 8 |
| bool | 9 |
| complex64 | 10
| complex128 | 11 |
<a name="3"></a>
## 功能使用
Paddle 推理模型有3种形式,每种形式的读模型的方式都不同,散列方式必须以路径方式加载,其余2种采用目录或文件方式均可。
1) Paddle 2.0前版本:`__model__`, `__params__`
2) Paddle 2.0后版本:`*.pdmodel`, `*.pdiparams`
3) 散列:`__model__`, `conv2d_1.w_0`, `conv2d_2.w_0`, `fc_1.w_0`, `conv2d_1.b_0`, ...
`paddle_serving_client.convert` 接口既支持 PYTHON 命令方式执行,又支持 代码中引入运行。
| 参数 | 类型 | 默认值 | 描述 |
|--------------|------|-----------|--------------------------------|
| `dirname` | str | - | 需要转换的模型文件存储路径,Program结构文件和参数文件均保存在此目录。|
......@@ -29,24 +100,73 @@ serving_io.inference_model_to_serving(dirname, serving_server="serving_server",
| `model_filename` | str | None | 存储需要转换的模型Inference Program结构的文件名称。如果设置为None,则使用 `__model__` 作为默认的文件名 |
| `params_filename` | str | None | 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中,它才需要被指定。如果模型参数是存储在各自分离的文件中,设置它的值为None |
### 从动态图模型中导出
<a name="3.1"></a>
**一.PYTHON 命令执行**
首先需要安装 `paddle_serivng_client` 包,以目录方式加载模型。
示例一,是以模型路径方式加载模型,适用于全部3种类型。
```python
python3 -m paddle_serving_client.convert --dirname ./your_inference_model_dir
```
示例二,以指定加载 `当前路径` 下模型 `dygraph_model.pdmodel``dygraph_model.pdiparams`,保存结果在 `serving_server``serving_client` 目录。
```python
python3 -m paddle_serving_client.convert --dirname . --model_filename dygraph_model.pdmodel --params_filename dygraph_model.pdiparams --serving_server serving_server --serving_client serving_client
```
<a name="3.2"></a>
**二.代码引入执行**
代码引入执行方式,通过 `import io` 包并调用 `inference_model_to_serving` 实现模型参数保存。
```python
import paddle_serving_client.io as serving_io
serving_io.inference_model_to_serving(dirname, serving_server="serving_server", serving_client="serving_client", model_filename=None, params_filename=None)
```
<a name="4"></a>
PaddlePaddle 2.0提供了全新的动态图模式,因此我们这里以imagenet ResNet50动态图为示例教学如何从已保存模型导出,并用于真实的在线预测场景。
## Serving 部署
生成完的模型可直接用于服务化推理,服务端使用和客户端使用。
<a name="4.1"></a>
**一.服务端部署示例**
示例一:C++ Serving 启动服务
```
wget https://paddle-serving.bj.bcebos.com/others/dygraph_res50.tar #模型
wget https://paddle-serving.bj.bcebos.com/imagenet-example/daisy.jpg #示例输入(向日葵)
tar xf dygraph_res50.tar
python -m paddle_serving_client.convert --dirname . --model_filename dygraph_model.pdmodel --params_filename dygraph_model.pdiparams --serving_server serving_server --serving_client serving_client
python3 -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 0
```
我们可以看到`serving_server``serving_client`文件夹分别保存着模型的服务端和客户端配置
启动服务端(GPU)
示例二:Python Pipeline 启动服务,在 `config.yml` 中指定模型路径
```
python -m paddle_serving_server.serve --model serving_server --port 9393 --gpu_id 0
op:
det:
#并发数,is_thread_op=True时,为线程并发;否则为进程并发
concurrency: 6
#当op配置没有server_endpoints时,从local_service_conf读取本地服务配置
local_service_conf:
#client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测
client_type: local_predictor
#det模型路径
model_config: ocr_det_model
#Fetch结果列表,以client_config中fetch_var的alias_name为准
fetch_list: ["save_infer_model/scale_0.tmp_1"]
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
device_type: 0
```
客户端写法,保存为`test_client.py`
<a name="4.2"></a>
**二.客户端部署示例**
通过 `client` 对象的 `load_client_config` 接口加载模型配置信息
```
from paddle_serving_client import Client
from paddle_serving_app.reader import Sequential, File2Image, Resize, CenterCrop
......@@ -65,41 +185,4 @@ seq = Sequential([
image_file = "daisy.jpg"
img = seq(image_file)
fetch_map = client.predict(feed={"inputs": img}, fetch=["save_infer_model/scale_0.tmp_0"])
print(fetch_map["save_infer_model/scale_0.tmp_0"].reshape(-1))
```
执行
```
python test_client.py
```
即可看到成功的执行了预测,以上就是动态图ResNet50模型在Serving上预测的内容,其他动态图模型使用方式与之类似。
## 从训练或预测脚本中保存(静态图)
目前,Paddle Serving提供了一个save_model接口供用户访问,该接口与Paddle的`save_inference_model`类似。
``` python
import paddle_serving_client.io as serving_io
serving_io.save_model("imdb_model", "imdb_client_conf",
{"words": data}, {"prediction": prediction},
paddle.static.default_main_program())
```
imdb_model是具有服务配置的服务器端模型。 imdb_client_conf是客户端rpc配置。
Serving有一个提供给用户存放Feed和Fetch变量信息的字典。 在示例中,`{"words":data}` 是用于指定已保存推理模型输入的提要字典。`{"prediction":projection}`是指定保存的推理模型输出的字典。可以为feed和fetch变量定义一个别名。 如何使用别名的例子 示例如下:
``` python
from paddle_serving_client import Client
import sys
client = Client()
client.load_client_config(sys.argv[1])
client.connect(["127.0.0.1:9393"])
for line in sys.stdin:
group = line.strip().split()
words = [int(x) for x in group[1:int(group[0]) + 1]]
label = [int(group[-1])]
feed = {"words": words, "label": label}
fetch = ["acc", "cost", "prediction"]
fetch_map = client.predict(feed=feed, fetch=fetch)
print("{} {}".format(fetch_map["prediction"][1], label[0]))
```
......@@ -8,13 +8,13 @@
- 这个服务接口不够安全,需要做相应的鉴权。
- 这个服务接口不能够控制流量,无法合理利用资源。
本文档的作用,就以 Uci 房价预测服务为例,来介绍如何强化预测服务API接口安全。API网关作为流量入口,对接口进行统一管理。但API网关可以提供流量加密和鉴权等安全功能。
本文档的作用,就以 Uci 房价预测服务为例,来介绍如何强化预测服务 API 接口安全。API 网关作为流量入口,对接口进行统一管理。但 API 网关可以提供流量加密和鉴权等安全功能。
## Docker部署
可以使用docker-compose来部署安全网关。这个示例的步骤就是 [部署本地Serving容器] - [部署本地安全网关] - [通过安全网关访问Serving]
可以使用 docker-compose 来部署安全网关。这个示例的步骤就是 [部署本地Serving容器] - [部署本地安全网关] - [通过安全网关访问Serving]
**注明:** docker-compose与docker不一样,它依赖于docker,一次可以部署多个docker容器,可以类比于本地版的kubenetes,docker-compose的教程请参考[docker-compose安装](https://docs.docker.com/compose/install/)
**注明:** docker-compose 与 docker 不一样,它依赖于 docker,一次可以部署多个 docker 容器,可以类比于本地版的 kubenetes,docker-compose 的教程请参考[docker-compose安装](https://docs.docker.com/compose/install/)
```shell
docker-compose -f tools/auth/auth-serving-docker.yaml up -d
......@@ -30,50 +30,49 @@ ee59a3dd4806 registry.baidubce.com/serving_dev/serving-runtime:cpu-py36
665fd8a34e15 redis:latest "docker-entrypoint.s…" About an hour ago Up About an hour 0.0.0.0:6379->6379/tcp anquan_redis_1
```
其中我们之前serving容器 以 9393端口暴露,KONG网关的端口是8443, KONG的Web控制台的端口是8001。接下来我们在浏览器访问 `https://$IP_ADDR:8005`, 其中 IP_ADDR就是宿主机的IP。
>> **注意**: 第一次登录的时候可能需要输入 Name : admin 以及 Kong Admin URL : http://kong:8001
<img src="images/kong-dashboard.png">
可以看到在注册结束后,登陆,看到了 DASHBOARD,我们先看SERVICES,可以看到`serving_service`,这意味着我们端口在9393的Serving服务已经在KONG当中被注册。
其中我们之前 serving 容器 以 9393 端口暴露,KONG 网关的端口是 8443, KONG 的 Web 控制台的端口是 8001。接下来我们在浏览器访问 `https://$IP_ADDR:8001`, 其中 IP_ADDR 就是宿主机的 IP 。
<img src="images/kong-services.png">
<img src="images/kong-routes.png">
<img src="../images/kong-dashboard.png">
可以看到在注册结束后,登陆,看到了 DASHBOARD,我们先看 SERVICES,可以看到 `serving_service`,这意味着我们端口在 9393 的 Serving 服务已经在 KONG 当中被注册。
然后在ROUTES中,我们可以看到 serving 被链接到了 `/serving-uci`
<img src="../images/kong-services.png">
<img src="../images/kong-routes.png">
最后我们点击 CONSUMERS - default_user - Credentials - API KEYS ,我们可以看到 `Api Keys` 下看到很多key
然后在 ROUTES 中,我们可以看到 serving 被链接到了 `/serving-uci`
<img src="images/kong-api_keys.png">
最后我们点击 CONSUMERS - default_user - Credentials - API KEYS ,我们可以看到 `Api Keys` 下看到很多 key
接下来可以通过curl访问
<img src="../images/kong-api_keys.png">
接下来可以通过 curl 访问
```shell
curl -H "Content-Type:application/json" -H "X-INSTANCE-ID:kong_ins" -H "apikey:hP6v25BQVS5CcS1nqKpxdrFkUxze9JWD" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' https://127.0.0.1:8443/serving-uci/uci/prediction -k
```
与之前的Serving HTTP服务相比,有以下区别。
与之前的 Serving HTTP 服务相比,有以下区别。
- 使用https加密访问,而不是http
- 使用serving_uci的路径映射到网关
-header处增加了 `X-INSTANCE-ID``apikey`
- 使用 https 加密访问,而不是 http
- 使用 serving_uci 的路径映射到网关
- header 处增加了 `X-INSTANCE-ID``apikey`
## K8S部署
## K8S 部署
同样,我们也提供了K8S集群部署Serving安全网关的方式。
同样,我们也提供了 K8S 集群部署 Serving 安全网关的方式。
### Step 1:启动Serving服务
**一. 启动 Serving 服务**
我们仍然以 [Uci房价预测](../examples/C++/fit_a_line/)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [Kubernetes集群上部署Paddle Serving](./Run_On_Kubernetes_CN.md)
我们仍然以 [Uci房价预测](../examples/C++/fit_a_line/)服务作为例子,这里省略了镜像制作的过程,详情可以参考 [ Kubernetes 集群上部署Paddle Serving](./Run_On_Kubernetes_CN.md)
在这里我们直接执行
```
kubectl apply -f tools/auth/serving-demo-k8s.yaml
```
可以看到
### Step 2: 安装 KONG (一个集群只需要执行一次就可以)
接下来我们执行KONG Ingress的安装
**二. 安装 KONG (一个集群只需要执行一次就可以)**
接下来我们执行 KONG Ingress 的安装
```
kubectl apply -f tools/auth/kong-install.yaml
```
......@@ -106,15 +105,15 @@ kong kong-validation-webhook ClusterIP 172.16.114.93 <none>
```
### Step 3: 创建Ingress资源
**三. 创建 Ingress 资源**
接下来需要做Serving服务和KONG的链接
接下来需要做 Serving 服务和 KONG 的链接
```
kubectl apply -f tools/auth/kong-ingress-k8s.yaml
```
我们也给出yaml文件内容
我们也给出 yaml 文件内容
```
apiVersion: extensions/v1beta1
kind: Ingress
......@@ -132,22 +131,22 @@ spec:
serviceName: {{SERVING_SERVICE_NAME}}
servicePort: {{SERVICE_PORT}}
```
其中serviceName就是uci,servicePort就是9393,如果是别的服务就需要改这两个字段,最终会映射到`/foo`下。
其中 serviceName 就是 uci,servicePort 就是 9393,如果是别的服务就需要改这两个字段,最终会映射到`/foo`下。
在这一步之后,我们就可以
```
curl -H "Content-Type:application/json" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' http://$IP:$PORT/foo/uci/prediction
```
### Step 4: 增加安全网关限制
**四. 增加安全网关限制**
之前的接口没有鉴权功能,无法验证用户身份合法性,现在我们添加一个key-auth插件
之前的接口没有鉴权功能,无法验证用户身份合法性,现在我们添加一个 key-auth 插件
执行
```
kubectl apply -f key-auth-k8s.yaml
```
其中,yaml文内容为
其中,yaml 文内容为
```
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
......@@ -156,7 +155,7 @@ metadata:
plugin: key-auth
```
现在,需要创建secret,key值为用户指定,需要在请求时携带Header中apikey字段
现在,需要创建 secret,key 值为用户指定,需要在请求时携带 Header 中 apikey 字段
执行
```
kubectl create secret generic default-apikey \
......@@ -164,14 +163,14 @@ kubectl create secret generic default-apikey \
--from-literal=key=ZGVmYXVsdC1hcGlrZXkK
```
在这里,我们的key是随意制定了一串 `ZGVmYXVsdC1hcGlrZXkK`,实际情况也可以
创建一个用户(consumer)标识访问者身份,并未该用户绑定apikey。
在这里,我们的 key 是随意制定了一串 `ZGVmYXVsdC1hcGlrZXkK`,实际情况也可以
创建一个用户(consumer)标识访问者身份,并未该用户绑定 apikey。
执行
```
kubectl apply -f kong-consumer-k8s.yaml
```
其中,yaml文内容为
其中,yaml 文内容为
```
apiVersion: configuration.konghq.com/v1
kind: KongConsumer
......@@ -184,13 +183,13 @@ credentials:
- default-apikey
```
如果我们这时还想再像上一步一样的做curl访问,会发现已经无法访问,此时已经具备了安全能力,我们需要对应的key。
如果我们这时还想再像上一步一样的做 curl 访问,会发现已经无法访问,此时已经具备了安全能力,我们需要对应的 key。
### Step 5: 通过API Key访问服务
**五. 通过 API Key 访问服务**
执行
```
curl -H "Content-Type:application/json" -H "apikey:ZGVmYXVsdC1hcGlrZXkK" -X POST -d '{"feed":[{"x": [0.0137, -0.1136, 0.2553, -0.0692, 0.0582, -0.0727, -0.1583, -0.0584, 0.6283, 0.4919, 0.1856, 0.0795, -0.0332]}], "fetch":["price"]}' https://$IP:$PORT/foo/uci/prediction -k
```
我们可以看到 apikey 已经加入到了curl请求的header当中。
我们可以看到 apikey 已经加入到了 curl 请求的 header 当中。
......@@ -84,23 +84,37 @@ workdir_9393
更多启动参数详见下表:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL. Need open with ir_optim. |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
| `use_ascend_cl` | bool | False | Enable for ascend910; Use with use_lite for ascend310 |
| `request_cache_size` | int | `0` | Bytes size of request cache. By default, the cache is disabled |
| `--thread` | int | `2` | Number of brpc service thread |
| `--runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `--batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `--gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `--port` | int | `9292` | Exposed port of current service to users |
| `--model` | str[]| `""` | Path of paddle model directory to be served |
| `--mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `--ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `--use_mkl` (Only for cpu version) | - | - | Run inference with MKL. Need open with ir_optim. |
| `--use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `--use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `--use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `--precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `--use_calib` | bool | False | Use TRT int8 calibration |
| `--gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
| `--use_ascend_cl` | bool | False | Enable for ascend910; Use with use_lite for ascend310 |
| `--request_cache_size` | int | `0` | Bytes size of request cache. By default, the cache is disabled |
| `--enable_prometheus` | bool | False | Use Prometheus |
| `--prometheus_port` | int | 19393 | Port of the Prometheus |
| `--use_dist_model | bool | False | Use distributed model or not |
| `--dist_carrier_id` | str | "" | Carrier id of distributed model |
| `--dist_cfg_file` | str | "" | Config file of distributed model |
| `--dist_endpoints` | str | "" | Endpoints of distributed model. splited by comma |
| `--dist_nranks` | int | 0 | The number of rank in the distributed model|
| `--dist_subgraph_index` | int | -1 | The subgraph index of distributed model|
| `--dist_master_serving` | bool | False | The master serving of distributed inference |
| `--min_subgraph_size` | str | "" | The min size of subgraph |
| `--gpu_memory_mb` | int | 50 | Initially allocate GPU storage size, 50 MB default.|
| `--cpu_math_thread_num` | int | 1 | Initialize the number of CPU computing threads|
| `--trt_workspace_size` | int | 33554432| Initialize allocation 1 << 25 GPU storage size for tensorRT|
| `--trt_use_static` | bool | False | Initialize TRT with static data|
#### 当您的某个模型想使用多张GPU卡部署时.
```BASH
......
......@@ -83,23 +83,38 @@ workdir_9393
More flags:
| Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread |
| `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served |
| `mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL. Need open with ir_optim. |
| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `use_calib` | bool | False | Use TRT int8 calibration |
| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
| `use_ascend_cl` | bool | False | Enable for ascend910; Use with use_lite for ascend310 |
| `request_cache_size` | int | `0` | Bytes size of request cache. By default, the cache is disabled |
| `--thread` | int | `2` | Number of brpc service thread |
| `--runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `--batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `--gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `--port` | int | `9292` | Exposed port of current service to users |
| `--model` | str[]| `""` | Path of paddle model directory to be served |
| `--mem_optim_off` | - | - | Disable memory / graphic memory optimization |
| `--ir_optim` | bool | False | Enable analysis and optimization of calculation graph |
| `--use_mkl` (Only for cpu version) | - | - | Run inference with MKL. Need open with ir_optim. |
| `--use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. |
| `--use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. |
| `--use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. |
| `--precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 |
| `--use_calib` | bool | False | Use TRT int8 calibration |
| `--gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS |
| `--use_ascend_cl` | bool | False | Enable for ascend910; Use with use_lite for ascend310 |
| `--request_cache_size` | int | `0` | Bytes size of request cache. By default, the cache is disabled |
| `--enable_prometheus` | bool | False | Use Prometheus |
| `--prometheus_port` | int | 19393 | Port of the Prometheus |
| `--use_dist_model | bool | False | Use distributed model or not |
| `--dist_carrier_id` | str | "" | Carrier id of distributed model |
| `--dist_cfg_file` | str | "" | Config file of distributed model |
| `--dist_endpoints` | str | "" | Endpoints of distributed model. splited by comma |
| `--dist_nranks` | int | 0 | The number of rank in the distributed model|
| `--dist_subgraph_index` | int | -1 | The subgraph index of distributed model|
| `--dist_master_serving` | bool | False | The master serving of distributed inference |
| `--min_subgraph_size` | str | "" | The min size of subgraph |
| `--gpu_memory_mb` | int | 50 | Initially allocate GPU storage size, 50 MB default.|
| `--cpu_math_thread_num` | int | 1 | Initialize the number of CPU computing threads|
| `--trt_workspace_size` | int | 33554432| Initialize allocation 1 << 25 GPU storage size for tensorRT|
| `--trt_use_static` | bool | False | Initialize TRT with static data|
#### Serving model with multiple gpus.
```BASH
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册