Merge pull request #10 from PaddlePaddle/develop

Develop

Merge pull request #10 from PaddlePaddle/develop
Develop
39eb8b22 · huangjianhui · GitHub · 6e39fca9 · ebdc4d13 · 39eb8b22
14 changed file
--- a/README.md
+++ b/README.md
@@ -30,13 +30,14 @@ The goal of Paddle Serving is to provide high-performance, flexible and easy-to-
 - Integrate high-performance server-side inference engine paddle Inference and mobile-side engine paddle Lite. Models of other machine learning platforms (Caffe/TensorFlow/ONNX/PyTorch) can be migrated to paddle through [x2paddle](https://github.com/PaddlePaddle/X2Paddle).
 - There are two frameworks, namely high-performance C++ Serving and high-easy-to-use Python pipeline. The C++ Serving is based on the bRPC network framework to create a high-throughput, low-latency inference service, and its performance indicators are ahead of competing products. The Python pipeline is based on the gRPC/gRPC-Gateway network framework and the Python language to build a highly easy-to-use and high-throughput inference service. How to choose which one please see [Techinical Selection](doc/Serving_Design_EN.md#21-design-selection).
 - Support multiple [protocols](doc/C++_Serving/Inference_Protocols_CN.md) such as HTTP, gRPC, bRPC, and provide C++, Python, Java language SDK.
- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, etc.
+- Design and implement a high-performance inference service framework for asynchronous pipelines based on directed acyclic graph (DAG), with features such as multi-model combination, asynchronous scheduling, concurrent inference, dynamic batch, multi-card multi-stream inference, request cache, etc.
 - Adapt to a variety of commonly used computing hardwares, such as x86 (Intel) CPU, ARM CPU, Nvidia GPU, Kunlun XPU, HUAWEI Ascend 310/910, HYGON DCU、Nvidia Jetson etc. 
 - Integrate acceleration libraries of Intel MKLDNN and  Nvidia TensorRT, and low-precision and quantitative inference.
 - Provide a model security deployment solution, including encryption model deployment, and authentication mechanism, HTTPs security gateway, which is used in practice.
 - Support cloud deployment, provide a deployment case of Baidu Cloud Intelligent Cloud kubernetes cluster.
 - Provide more than 40 classic pre-model deployment examples, such as PaddleOCR, PaddleClas, PaddleDetection, PaddleSeg, PaddleNLP, PaddleRec and other suites, and more models continue to expand.
 - Supports distributed deployment of large-scale sparse parameter index models, with features such as multiple tables, multiple shards, multiple copies, local high-frequency cache, etc., and can be deployed on a single machine or clouds.
+- Support service monitoring, provide prometheus-based performance statistics and port access


 <h2 align="center">Tutorial</h2>
@@ -62,7 +63,6 @@ This chapter guides you through the installation and deployment steps. It is str
 - [Deploy Paddle Serving on Kubernetes(Chinese)](doc/Run_On_Kubernetes_CN.md)
 - [Deploy Paddle Serving with Security gateway(Chinese)](doc/Serving_Auth_Docker_CN.md)
 - Deploy on more hardwares[[百度昆仑](doc/Run_On_XPU_CN.md)、[华为昇腾](doc/Run_On_NPU_CN.md)、[海光DCU](doc/Run_On_DCU_CN.md)、[Jetson](doc/Run_On_JETSON_CN.md)]
- [Docker镜像](doc/Docker_Images_CN.md)
 - [Docker Images](doc/Docker_Images_EN.md)
 - [Latest Wheel packages](doc/Latest_Packages_CN.md)

@@ -76,6 +76,7 @@ The first step is to call the model save interface to generate a model parameter
 - [Guide for RESTful/gRPC/bRPC APIs(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
 - [Infer on quantizative models](doc/Low_Precision_EN.md)
 - [Data format of classic models(Chinese)](doc/Process_data_CN.md)
+- [Prometheus(Chinese)](doc/Prometheus_CN.md)
 - [C++ Serving(Chinese)](doc/C++_Serving/Introduction_CN.md) 
  - [Protocols(Chinese)](doc/C++_Serving/Inference_Protocols_CN.md)
  - [Hot loading models](doc/C++_Serving/Hot_Loading_EN.md)
@@ -84,8 +85,10 @@ The first step is to call the model save interface to generate a model parameter
  - [Analyze and optimize performance(Chinese)](doc/C++_Serving/Performance_Tuning_CN.md)
  - [Benchmark(Chinese)](doc/C++_Serving/Benchmark_CN.md)
  - [Multiple models in series(Chinese)](doc/C++_Serving/2+_model.md)
+  - [Request Cache(Chinese)](doc/C++_Serving/Request_Cache_CN.md)
 - [Python Pipeline](doc/Python_Pipeline/Pipeline_Design_EN.md)
  - [Analyze and optimize performance](doc/Python_Pipeline/Performance_Tuning_EN.md)
+  - [TensorRT dynamic Shape](doc/TensorRT_Dynamic_Shape_EN.md)
  - [Benchmark(Chinese)](doc/Python_Pipeline/Benchmark_CN.md)
 - Client SDK
  - [Python SDK(Chinese)](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
@@ -105,13 +108,13 @@ For Paddle Serving developers, we provide extended documents such as custom OP,
 <h2 align="center">Model Zoo</h2>


-Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples,  for a total of 42 models.
+Paddle Serving works closely with the Paddle model suite, and implements a large number of service deployment examples, including image classification, object detection, language and text recognition, Chinese part of speech, sentiment analysis, content recommendation and other types of examples,  for a total of 45 models.

 <p align="center">

 | PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | 
 | :----:  | :----: | :----: | :----: | :----: | :----: | 
-| 8 | 12 | 13 | 2 | 3 | 4 | 
+| 8 | 12 | 14 | 2 | 3 | 6 | 

 </p>

@@ -151,6 +154,8 @@ If you want to contribute code to Paddle Serving, please reference [Contribution
 - Thanks to [@mcl-stone](https://github.com/mcl-stone) in updating faster_rcnn benchmark
 - Thanks to [@cg82616424](https://github.com/cg82616424) in updating the unet benchmark  modifying resize comment error
 - Thanks to [@cuicheng01](https://github.com/cuicheng01) for providing 11 PaddleClas models
+- Thanks to [@Jiaqi Liu](https://github.com/LiuChiachi) for supporting prediction for string list input
+- Thanks to [@Bin Lu](https://github.com/Intsigstephon) for adding pp-shitu example

 > Feedback


--- a/README_CN.md
+++ b/README_CN.md
@@ -29,13 +29,14 @@ Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发
 - 集成高性能服务端推理引擎paddle Inference和移动端引擎paddle Lite，其他机器学习平台（Caffe/TensorFlow/ONNX/PyTorch）可通过[x2paddle](https://github.com/PaddlePaddle/X2Paddle)工具迁移模型
 - 具有高性能C++和高易用Python 2套框架。C++框架基于高性能bRPC网络框架打造高吞吐、低延迟的推理服务，性能领先竞品。Python框架基于gRPC/gRPC-Gateway网络框架和Python语言构建高易用、高吞吐推理服务框架。技术选型参考[技术选型](doc/Serving_Design_CN.md#21-设计选型)
 - 支持HTTP、gRPC、bRPC等多种[协议](doc/C++_Serving/Inference_Protocols_CN.md)；提供C++、Python、Java语言SDK
- 设计并实现基于有向无环图(DAG)的异步流水线高性能推理框架，具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理等特性
+- 设计并实现基于有向无环图(DAG)的异步流水线高性能推理框架，具有多模型组合、异步调度、并发推理、动态批量、多卡多流推理、请求缓存等特性
 - 适配x86(Intel) CPU、ARM CPU、Nvidia GPU、昆仑XPU、华为昇腾310/910、海光DCU、Nvidia Jetson等多种硬件
 - 集成Intel MKLDNN、Nvidia TensorRT加速库，以及低精度和量化推理
 - 提供一套模型安全部署解决方案，包括加密模型部署、鉴权校验、HTTPs安全网关，并在实际项目中应用
 - 支持云端部署，提供百度云智能云kubernetes集群部署Paddle Serving案例
 - 提供丰富的经典预模型部署示例，如PaddleOCR、PaddleClas、PaddleDetection、PaddleSeg、PaddleNLP、PaddleRec等套件，共计40+个预训练精品模型
 - 支持大规模稀疏参数索引模型分布式部署，具有多表、多分片、多副本、本地高频cache等特性、可单机或云端部署
+- 支持服务监控，提供基于普罗米修斯的性能数据统计及端口访问


 <h2 align="center">教程</h2>
@@ -70,6 +71,7 @@ Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发
 - [RESTful/gRPC/bRPC API指南](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
 - [低精度推理](doc/Low_Precision_CN.md)
 - [常见模型数据处理](doc/Process_data_CN.md)
+- [普罗米修斯](doc/Prometheus_CN.md)
 - [C++ Serving简介](doc/C++_Serving/Introduction_CN.md) 
  - [协议](doc/C++_Serving/Inference_Protocols_CN.md)
  - [模型热加载](doc/C++_Serving/Hot_Loading_CN.md)
@@ -78,8 +80,10 @@ Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发
  - [性能优化指南](doc/C++_Serving/Performance_Tuning_CN.md)
  - [性能指标](doc/C++_Serving/Benchmark_CN.md)
  - [多模型串联](doc/C++_Serving/2+_model.md)
+  - [请求缓存](doc/C++_Serving/Request_Cache_CN.md)
 - [Python Pipeline设计](doc/Python_Pipeline/Pipeline_Design_CN.md)
  - [性能优化指南](doc/Python_Pipeline/Performance_Tuning_CN.md)
+  - [TensorRT动态shape](doc/TensorRT_Dynamic_Shape_CN.md)
  - [性能指标](doc/Python_Pipeline/Benchmark_CN.md)
 - 客户端SDK
  - [Python SDK](doc/C++_Serving/Introduction_CN.md#42-多语言多协议Client)
@@ -96,13 +100,13 @@ Paddle Serving依托深度学习框架PaddlePaddle旨在帮助深度学习开发

 <h2 align="center">模型库</h2>

-Paddle Serving与Paddle模型套件紧密配合，实现大量服务化部署，包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例，以及Paddle全链条项目，共计42个模型。
+Paddle Serving与Paddle模型套件紧密配合，实现大量服务化部署，包括图像分类、物体检测、语言文本识别、中文词性、情感分析、内容推荐等多种类型示例，以及Paddle全链条项目，共计45个模型。

 <p align="center">

 | PaddleOCR | PaddleDetection | PaddleClas | PaddleSeg | PaddleRec | Paddle NLP | 
 | :----:  | :----: | :----: | :----: | :----: | :----: | 
-| 8 | 12 | 13 | 2 | 3 | 4 | 
+| 8 | 12 | 14 | 2 | 3 | 6 | 

 </p>

@@ -142,6 +146,8 @@ Paddle Serving与Paddle模型套件紧密配合，实现大量服务化部署，
 - 感谢 [@mcl-stone](https://github.com/mcl-stone) 提供faster rcnn benchmark脚本
 - 感谢 [@cg82616424](https://github.com/cg82616424) 提供unet benchmark脚本和修改部分注释错误
 - 感谢 [@cuicheng01](https://github.com/cuicheng01) 提供PaddleClas的11个模型
+- 感谢 [@Jiaqi Liu](https://github.com/LiuChiachi) 新增list[str]类型输入的预测支持
+- 感谢 [@Bin Lu](https://github.com/Intsigstephon) 提供PP-Shitu C++模型示例

 > 反馈


--- a/doc/C++_Serving/2+_model.md
+++ b/doc/C++_Serving/2+_model.md
@@ -220,4 +220,4 @@ python3 自定义.py ocr_det_client ocr_rec_client
 #ocr_det_client为第一个模型的Client端proto文件夹的相对路径
 #ocr_rec_client为第二个模型的Client端proto文件夹的相对路径
 ```
-此时，对于Server端而言，输入的数据的格式与`第一个模型的Client端proto格式`定义的一致，输出的数据格式与`最后一个模型的Client端proto`文件一致。一般情况下您无须关注此事，当您需要了解详细的[proto的定义，请参考此处](./Serving_Configure_CN.md)。
+此时，对于Server端而言，输入的数据的格式与`第一个模型的Client端proto格式`定义的一致，输出的数据格式与`最后一个模型的Client端proto`文件一致。一般情况下您无须关注此事，当您需要了解详细的[proto的定义，请参考此处](../Serving_Configure_CN.md)。
--- a/doc/C++_Serving/DAG_CN.md
+++ b/doc/C++_Serving/DAG_CN.md
@@ -40,7 +40,7 @@ op_seq_maker.add_op(general_infer_op)
 op_seq_maker.add_op(general_response_op)
 ```

-如果使用`命令行 + 配置文件的方式启动C++Server`只需[修改配置文件]((./Serving_Configure_CN.md))即可,无须修改👆的代码。
+如果使用`命令行 + 配置文件的方式启动C++Server`只需[修改配置文件](../Serving_Configure_CN.md)即可,无须修改👆的代码。


 对于简单的串联逻辑，我们将其简化为`Sequence`，使用`OpSeqMaker`进行构建。用户可以不指定每个节点的前继，默认按加入`OpSeqMaker`的顺序来确定前继。

--- a/doc/C++_Serving/DAG_EN.md
+++ b/doc/C++_Serving/DAG_EN.md
@@ -39,7 +39,7 @@ op_seq_maker.add_op(general_infer_op)
 op_seq_maker.add_op(general_response_op)
 ```

-If you use `the command line + configuration file method to start C++ server`, you only need to modify [the configuration file](./Serving_Configure_CN.md), don`t need to change any line of 👆 code.
+If you use `the command line + configuration file method to start C++ server`, you only need to modify [the configuration file](../Serving_Configure_CN.md), don`t need to change any line of 👆 code.

 For simple series logic, we simplify it and build it with `OpSeqMaker`. You can determine the successor by default according to the order of joining `OpSeqMaker` without specifying the successor of each node.


--- a/doc/Docker_Images_CN.md
+++ b/doc/Docker_Images_CN.md
@@ -80,4 +80,9 @@ registry.baidubce.com/paddlepaddle/serving:xpu-x86 # for x86 xpu user

 运行镜像比开发镜像更加轻量化, 运行镜像提供了serving的whl和bin，但为了运行期更小的镜像体积，没有提供诸如cmake这样但开发工具。 如果您想了解有关信息，请检查文档[在Kubernetes上使用Paddle Serving](./Run_On_Kubernetes_CN.md)。

-
+| Env      | Version | Docker images tag            | OS        | Gcc Version | Size |
+|----------|---------|------------------------------|-----------|-------------|------|
+|    CPU   | 0.8.0 | 0.8.0-runtime                 | Ubuntu 16 |  8.2.0       | 3.9 GB |
+| Cuda10.1 | 0.8.0 | 0.8.0-cuda10.1-cudnn7-runtime  | Ubuntu 16 |   8.2.0       | 10 GB |
+| Cuda10.2 | 0.8.0 | 0.8.0-cuda10.2-cudnn8-runtime  | Ubuntu 16 |   8.2.0       | 10.1 GB |
+| Cuda11.2 | 0.8.0 | 0.8.0-cuda11.2-cudnn8-runtime| Ubuntu 16 |    8.2.0       | 14.2 GB |
--- a/doc/Docker_Images_EN.md
+++ b/doc/Docker_Images_EN.md
@@ -83,4 +83,9 @@ Running Images:

 Running Images is lighter than Develop Images, and Running Images are made up with serving whl and bin, but without develop tools like cmake because of lower image size. If you want to know about it, plese check the document [Paddle Serving on Kubernetes](./Run_On_Kubernetes_CN.md).

-
+| Env      | Version | Docker images tag            | OS        | Gcc Version | Size |
+|----------|---------|------------------------------|-----------|-------------|------|
+|    CPU   | 0.8.0 | 0.8.0-runtime                 | Ubuntu 16 |  8.2.0       | 3.9 GB |
+| Cuda10.1 | 0.8.0 | 0.8.0-cuda10.1-cudnn7-runtime  | Ubuntu 16 |   8.2.0       | 10 GB |
+| Cuda10.2 | 0.8.0 | 0.8.0-cuda10.2-cudnn8-runtime  | Ubuntu 16 |   8.2.0       | 10.1 GB |
+| Cuda11.2 | 0.8.0 | 0.8.0-cuda11.2-cudnn8-runtime| Ubuntu 16 |    8.2.0       | 14.2 GB |
--- a/doc/Prometheus_CN.md
+++ b/doc/Prometheus_CN.md
@@ -7,7 +7,7 @@ curl http://localhost:19393/metrics

 ## 配置使用

-### C+ Server
+### C++ Server

 对于 C++ Server 来说，启动服务时请添加如下参数


--- a/doc/Quick_Start_CN.md
+++ b/doc/Quick_Start_CN.md
@@ -23,25 +23,8 @@ Paddle Serving 为用户提供了基于 HTTP 和 RPC 的服务
 ``` shell
 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
 ```
-<center>
-
-| Argument                                       | Type | Default | Description                                           |
-| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
-| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
-| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
-| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
-| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
-| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
-| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
-| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
-| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
-| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
-| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT                           |
-| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference                              |
-| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU        |
-| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
-| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
-| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
+
+完整参数列表参阅文档[Serving配置](Serving_Configure_EN.md#c-serving)

 #### 异步模型的说明
    异步模式适用于1、请求数量非常大的情况，2、多模型串联，想要分别指定每个模型的并发数的情况。

--- a/doc/Quick_Start_EN.md
+++ b/doc/Quick_Start_EN.md
@@ -20,23 +20,8 @@ A user can also start a RPC service with `paddle_serving_server.serve`. RPC serv
 ``` shell
 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
 ```
-<center>
-
-| Argument                                       | Type | Default | Description                                           |
-| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
-| `thread`                                       | int  | `4`     | Concurrency of current service                        |
-| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
-| `model`                                        | str  | `""`    | Path of paddle model directory to be served           |
-| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
-| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
-| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
-| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT                           |
-| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference                              |
-| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU        |
-| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
-| `use_calib`                                    | bool | False   | Only for deployment with TensorRT                     |
-
-</center>
+
+For a complete list of parameters, see the document [Serving Configuration](Serving_Configure_CN.md#c-serving)

 ```python
 # A user can visit rpc service through paddle_serving_client API

--- a/doc/Run_On_JETSON_CN.md
+++ b/doc/Run_On_JETSON_CN.md
@@ -4,7 +4,7 @@ Paddle Serving支持使用JETSON进行预测部署。目前仅支持Pipeline模

 ### 安装PaddlePaddle

-可以参考[NV Jetson部署示例]（https://paddleinference.paddlepaddle.org.cn/demo_tutorial/cuda_jetson_demo.html）安装python版本的paddlepaddle
+可以参考[NV Jetson部署示例](https://paddleinference.paddlepaddle.org.cn/demo_tutorial/cuda_jetson_demo.html) 安装python版本的paddlepaddle


 ### 安装PaddleServing

--- a/doc/Save_CN.md
+++ b/doc/Save_CN.md
@@ -23,7 +23,7 @@ serving_io.inference_model_to_serving(dirname, serving_server="serving_server",
 | `model_filename` | str | None | 存储需要转换的模型Inference Program结构的文件名称。如果设置为None，则使用 `__model__` 作为默认的文件名 |
 | `params_filename` | str | None | 存储需要转换的模型所有参数的文件名称。当且仅当所有模型参数被保>存在一个单独的二进制文件中，它才需要被指定。如果模型参数是存储在各自分离的文件中，设置它的值为None |

-**示例：从动态图模型中导出**
+### 从动态图模型中导出

 PaddlePaddle 2.0提供了全新的动态图模式，因此我们这里以imagenet ResNet50动态图为示例教学如何从已保存模型导出，并用于真实的在线预测场景。


--- a/doc/Save_EN.md
+++ b/doc/Save_EN.md
@@ -23,7 +23,7 @@ Arguments are the same as `inference_model_to_serving` API.
 | `model_filename` | str | None | The name of file to load the inference program. If it is None, the default filename `__model__` will be used. |
 | `params_filename` | str | None | The name of file to load all parameters. It is only used for the case that all parameters were saved in a single binary file. If parameters were saved in separate files, set it as None. |

-**Demo: Convert From Dynamic Graph**
+### Convert From Dynamic Graph

 PaddlePaddle 2.0 provides a new dynamic graph mode, so here we use imagenet ResNet50 dynamic graph as an example to teach how to export from a saved model and use it for real online inference scenarios.


--- a/doc/Serving_Configure_CN.md
+++ b/doc/Serving_Configure_CN.md
-# Serving Configuration
+# Serving配置

 (简体中文|[English](./Serving_Configure_EN.md))