Merge branch 'develop' into develop

b7319f0c · TeslaZhao · c6fb227c · b7319f0c · b7319f0c · b7319f0c
7 changed file
--- a/doc/C++_Serving/Performance_Tuning_CN.md
+++ b/doc/C++_Serving/Performance_Tuning_CN.md
@@ -58,3 +58,17 @@ Server端<mark>**线程数N**</mark>的设置需要结合三个因素来综合
 ## 4.3 示例
 请参考[examples/C++/PaddleOCR/ocr/README_CN.md](../../examples/C++/PaddleOCR/ocr/README_CN.md)中`C++ OCR Service服务章节`和[Paddle Serving中的集成预测](./Model_Ensemble_CN.md)中的例子。
+# 5.请求缓存
+当<mark>**您的业务中有较多重复请求**</mark>时，您可以考虑使用C++Serving[Request Cache](./Request_Cache_CN.md)来提升服务性能
+## 5.1 优点
+服务可以缓存请求结果，将请求数据与结果以键值对的形式保存。当有重复请求到来时，可以根据请求数据直接从缓存中获取结果并返回，而不需要进行模型预测等处理（耗时与请求数据大小有关，在毫秒量级）。
+## 5.2 缺点
+1) 需要额外的系统内存用于缓存请求结果，具体缓存大小可以通过启动参数进行配置。
+2) 对于未命中请求，会增加额外的时间用于根据请求数据检索缓存（耗时增加1%左右）。
+## 5.3 示例
+请参考[Request Cache](./Request_Cache_CN.md)中的使用方法
\ No newline at end of file
--- a/doc/Prometheus_CN.md
+++ b/doc/Prometheus_CN.md
+## Paddle Serving使用普罗米修斯监控
+Paddle Serving支持普罗米修斯进行性能数据的监控。默认的访问接口为`http://localhost:19393/metrics`。数据形式为文本格式，您可以使用如下命令直观的看到：
+```
+curl http://localhost:19393/metrics
+```
+## 配置使用
+### C+ Server
+对于 C++ Server 来说，启动服务时请添加如下参数
+| 参数     | 参数说明                    | 备注                                                             |
+| :------- | :-------------------------- | :--------------------------------------------------------------- |
+| enable_prometheus | 开启Prometheus    | 开启Prometheus功能                                      |
+| prometheus_port  | Prometheus数据端口    | 默认为19393                                     |
+### Python Pipeline
+对于 Python Pipeline 来说，启动服务时请在配置文件config.yml中添加如下参数
+```
+dag:
+    #开启Prometheus
+    enable_prometheus: True
+    #配置Prometheus数据端口
+    prometheus_port: 19393
+```
+### 监控数据类型
+监控数据类型如下表
+| Metric                                         | Frequency   | Description                                           |
+| ---------------------------------------------- | ----------- | ----------------------------------------------------- |
+| `pd_query_request_success_total`               | Per request | Number of successful query requests                         |
+| `pd_query_request_failure_total`               | Per request | Number of failed query requests     |
+| `pd_inference_count_total`                     | Per request | Number of inferences performed      |
+| `pd_query_request_duration_us_total`           | Per request | Cumulative end-to-end query request handling time                            |
+| `pd_inference_duration_us_total`               | Per request | Cumulative time requests spend executing the inference model               |
+## 监控示例
+此处给出一个使用普罗米修斯进行服务监控的简单示例
+**1、获取镜像**
+```
+docker pull prom/node-exporter
+docker pull prom/prometheus
+```
+**2、运行镜像**
+```
+docker run -d -p 9100:9100 \
+  -v "/proc:/host/proc:ro" \
+  -v "/sys:/host/sys:ro" \
+  -v "/:/rootfs:ro" \
+  --net="host" \
+  prom/node-exporter
+```
+**3、配置**
+修改监控服务的配置文件/opt/prometheus/prometheus.yml，添加监控节点信息
+```
+global:
+  scrape_interval:     60s
+  evaluation_interval: 60s
+scrape_configs:
+  - job_name: prometheus
+    static_configs:
+      - targets: ['localhost:9090']
+        labels:
+          instance: prometheus
+  - job_name: linux
+    static_configs:
+      - targets: ['$IP:9100']
+        labels:
+          instance: localhost
+```
+**4、启动监控服务**
+```
+docker run  -d \
+  -p 9090:9090 \
+  -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml  \
+  prom/prometheus
+```
+访问 `http://serverip:9090/graph` 即可
\ No newline at end of file
--- a/doc/Run_On_DCU_CN.md
+++ b/doc/Run_On_DCU_CN.md
+## Paddle Serving使用海光芯片部署
+Paddle Serving支持使用海光DCU进行预测部署。目前支持的ROCm版本为4.0.1。
+## 安装Docker镜像
+我们推荐使用docker部署Serving服务，可以直接从Paddle的官方镜像库拉取预先装有ROCm4.0.1的docker镜像。
+```
+# 拉取镜像
+docker pull paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11
+# 启动容器，注意这里的参数，例如shm-size, device等都需要配置
+docker run -it --name paddle-rocm-dev --shm-size=128G \
+     --device=/dev/kfd --device=/dev/dri --group-add video \
+     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+     paddlepaddle/paddle:latest-dev-rocm4.0-miopen2.11 /bin/bash
+# 检查容器是否可以正确识别海光DCU设备
+rocm-smi
+# 预期得到以下结果：
+======================= ROCm System Management Interface =======================
+================================= Concise Info =================================
+GPU  Temp   AvgPwr  SCLK     MCLK    Fan   Perf  PwrCap  VRAM%  GPU%  
+0    50.0c  23.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%  
+1    48.0c  25.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%  
+2    48.0c  24.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%  
+3    49.0c  27.0W   1319Mhz  800Mhz  0.0%  auto  300.0W    0%   0%  
+================================================================================
+============================= End of ROCm SMI Log ==============================
+```
+## 编译、安装
+基本环境配置可参考[该文档](Compile_CN.md)进行配置。
+### 编译
+* 编译server部分
+```
+cd Serving
+mkdir -p server-build-dcu && cd server-build-dcu
+cmake -DPYTHON_INCLUDE_DIR=/opt/conda/include/python3.7m/ \
+    -DPYTHON_LIBRARIES=/opt/conda/lib/libpython3.7m.so \
+    -DPYTHON_EXECUTABLE=/opt/conda/bin/python \
+    -DWITH_MKL=ON \
+    -DWITH_ROCM=ON \
+    -DSERVER=ON ..
+make -j10
+```
+### 安装wheel包
+编译步骤完成后，会在各自编译目录$build_dir/python/dist生成whl包，分别安装即可。例如server步骤，会在server-build-arm/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
+## 部署使用示例
+以[resnet50](../examples/C++/PaddleClas/resnet_v2_50/README_CN.md)为例
+### 启动rpc服务
+启动rpc服务，基于1卡部署
+```
+python3 -m paddle_serving_server.serve --model resnet_v2_50_imagenet_model --port 9393 --gpu_ids 1
+```
+## 其他说明
+### 模型实例及说明
+支持海光芯片部署模型列表见[链接](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/09_hardware_support/rocm_docs/paddle_rocm_cn.html)。不同模型适配上存在差异，可能存在不支持的情况，部署使用存在问题时，欢迎以[Github issue](https://github.com/PaddlePaddle/Serving/issues)，我们会实时跟进。
+### 昆仑芯片支持相关参考资料
+* [海光芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/09_hardware_support/rocm_docs/paddle_install_cn.html)
\ No newline at end of file
--- a/doc/Run_On_JETSON_CN.md
+++ b/doc/Run_On_JETSON_CN.md
+## Paddle Serving使用JETSON部署
+Paddle Serving支持使用JETSON进行预测部署。目前仅支持Pipeline模式。
+### 安装PaddlePaddle
+可以参考[NV Jetson部署示例]（https://paddleinference.paddlepaddle.org.cn/demo_tutorial/cuda_jetson_demo.html）安装python版本的paddlepaddle
+### 安装PaddleServing
+安装ARM版本的whl包
+```
+# paddle-serving-server
+https://paddle-serving.bj.bcebos.com/whl/xpu/arm/paddle_serving_server_xpu-0.0.0.post2-py3-none-any.whl
+# paddle-serving-client
+https://paddle-serving.bj.bcebos.com/whl/xpu/arm/paddle_serving_client-0.0.0-cp36-none-any.whl
+# paddle-serving-app
+https://paddle-serving.bj.bcebos.com/whl/xpu/arm/paddle_serving_app-0.0.0-py3-none-any.whl
+```
+### 部署使用
+以[Uci](../examples/Pipeline/simple_web_service/README_CN.md)为例
+启动服务
+```
+python3 web_service.py &>log.txt &
+```
+其中修改config.yml中的对应配置项
+```
+            #计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
+            device_type: 1
+            #计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
+            devices: "0,1"
+```
+## 其他说明
+### Jetson支持相关参考资料
+* [Jetson运行飞桨](https://paddleinference.paddlepaddle.org.cn/demo_tutorial/cuda_jetson_demo.html)
\ No newline at end of file
--- a/doc/Run_On_NPU_CN.md
+++ b/doc/Run_On_NPU_CN.md
+## Paddle Serving使用昇腾NPU芯片部署
+Paddle Serving支持使用昇腾NPU芯片进行预测部署。目前支持在昇腾芯片(910/310)和arm服务器上进行部署，后续完善对其他异构硬件服务器部署能力。
+## 昇腾910
+### 安装Docker镜像
+我们推荐使用docker部署Serving服务，可以直接从Paddle的官方镜像库拉取预先装有 CANN 社区版 5.0.2.alpha005 的 docker 镜像。
+```
+# 拉取镜像
+docker pull paddlepaddle/paddle:latest-dev-cann5.0.2.alpha005-gcc82-aarch64
+# 启动容器，注意这里的参数 --device，容器仅映射设备ID为4到7的4张NPU卡，如需映射其他卡相应增改设备ID号即可
+docker run -it --name paddle-npu-dev -v /home/<user_name>:/workspace  \
+            --pids-limit 409600 --network=host --shm-size=128G \
+            --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+            --device=/dev/davinci4 --device=/dev/davinci5 \
+            --device=/dev/davinci6 --device=/dev/davinci7 \
+            --device=/dev/davinci_manager \
+            --device=/dev/devmm_svm \
+            --device=/dev/hisi_hdc \
+            -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+            -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+            -v /usr/local/dcmi:/usr/local/dcmi \
+            paddlepaddle/paddle:latest-dev-cann5.0.2.alpha005-gcc82-aarch64 /bin/bash
+# 检查容器中是否可以正确识别映射的昇腾DCU设备
+npu-smi info
+# 预期得到类似如下的结果
+------------------------------------------------------------------------------------+
+| npu-smi 1.9.3                    Version: 21.0.rc1                                 |
+----------------------+---------------+---------------------------------------------+
+| NPU   Name           | Health        | Power(W)   Temp(C)                          |
+| Chip                 | Bus-Id        | AICore(%)  Memory-Usage(MB)  HBM-Usage(MB)  |
+======================+===============+=============================================+
+| 4     910A           | OK            | 67.2       30                               |
+| 0                    | 0000:C2:00.0  | 0          303  / 15171      0    / 32768   |
+======================+===============+=============================================+
+| 5     910A           | OK            | 63.8       25                               |
+| 0                    | 0000:82:00.0  | 0          2123 / 15171      0    / 32768   |
+======================+===============+=============================================+
+| 6     910A           | OK            | 67.1       27                               |
+| 0                    | 0000:42:00.0  | 0          1061 / 15171      0    / 32768   |
+======================+===============+=============================================+
+| 7     910A           | OK            | 65.5       30                               |
+| 0                    | 0000:02:00.0  | 0          2563 / 15078      0    / 32768   |
+======================+===============+=============================================+
+```
+### 编译、安装
+基本环境配置可参考[该文档](Compile_CN.md)进行配置。
+***1、依赖安装***
+安装编译所需依赖库，包括patchelf、libcurl等
+```
+apt-get install patchelf libcurl4-openssl-dev libbz2-dev libgeos-dev
+```
+***2、GOLANG环境配置***
+下载并配置ARM版本的GOLANG-1.17.2
+```
+wget https://golang.org/dl/go1.17.2.linux-arm64.tar.gz
+tar zxvf go1.17.2.linux-arm64.tar.gz -C /usr/local/
+mkdir /root/go /root/go/bin /root/go/src
+echo "GOROOT=/usr/local/go" >> /root/.bashrc
+echo "GOPATH=/root/go" >> /root/.bashrc
+echo "PATH=/usr/local/go/bin:/root/go/bin:$PATH" >> /root/.bashrc
+source /root/.bashrc
+go env -w GO111MODULE=on
+go env -w GOPROXY=https://goproxy.cn,direct
+go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-grpc-gateway@v1.15.2
+go install github.com/grpc-ecosystem/grpc-gateway/protoc-gen-swagger@v1.15.2
+go install github.com/golang/protobuf/protoc-gen-go@v1.4.3
+go install google.golang.org/grpc@v1.33.0
+go env -w GO111MODULE=auto
+```
+***3、PYTHON环境配置***
+下载python依赖库并配置环境
+```
+pip3.7 install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple
+export PYTHONROOT=/opt/conda
+export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
+export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
+export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
+```
+***4、编译server***
+```
+mkdir build-server-npu && cd build-server-npu
+cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
+    -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
+    -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
+    -DCMAKE_INSTALL_PREFIX=./output \
+    -DWITH_ASCEND_CL=ON \
+    -DSERVER=ON ..
+make TARGET=ARMV8 -j16
+```
+***5、安装编译包***
+编译步骤完成后，会在各自编译目录$build_dir/python/dist生成whl包，分别安装即可。例如server步骤，会在server-build-npu/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
+### 部署使用
+为了支持arm+昇腾910服务部署，启动服务时需使用以下参数。
+| 参数     | 参数说明                    | 备注                                                             |
+| :------- | :-------------------------- | :--------------------------------------------------------------- |
+| use_ascend_cl | 使用Ascend CL进行预测      | 使用Ascend预测能力                                      |
+以[Bert](../examples/C++/PaddleNLP/bert/README_CN.md)为例
+启动rpc服务，使用Ascend npu优化加速能力
+```
+python3 -m paddle_serving_server.serve --model bert_seq128_model --thread 6 --port 9292 --use_ascend_cl
+```
+## 昇腾310
+### 安装Docker镜像
+我们推荐使用docker部署Serving服务，可以拉取装有 CANN 3.3.0 docker 镜像。
+```
+# 拉取镜像
+docker pull registry.baidubce.com/paddlepaddle/serving:ascend-aarch64-cann3.3.0-paddlelite-devel
+# 启动容器，注意这里的参数 --device，容器仅映射设备ID为4到7的4张NPU卡，如需映射其他卡相应增改设备ID号即可
+docker run -it --name paddle-npu-dev -v /home/<user_name>:/workspace  \
+            --pids-limit 409600 --network=host --shm-size=128G \
+            --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
+            --device=/dev/davinci4 --device=/dev/davinci5 \
+            --device=/dev/davinci6 --device=/dev/davinci7 \
+            --device=/dev/davinci_manager \
+            --device=/dev/devmm_svm \
+            --device=/dev/hisi_hdc \
+            -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+            -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+            -v /usr/local/dcmi:/usr/local/dcmi \
+            registry.baidubce.com/paddlepaddle/serving:ascend-aarch64-cann3.3.0-paddlelite-devel /bin/bash
+```
+### 编译、安装
+基本环境配置可参考[该文档](Compile_CN.md)进行配置。
+***1、PYTHON环境配置***
+下载python依赖库并配置环境
+```
+pip3.7 install -r python/requirements.txt -i https://mirror.baidu.com/pypi/simple
+export PYTHONROOT=/usr/local/python3.7.5
+export PYTHON_INCLUDE_DIR=$PYTHONROOT/include/python3.7m
+export PYTHON_LIBRARIES=$PYTHONROOT/lib/libpython3.7m.so
+export PYTHON_EXECUTABLE=$PYTHONROOT/bin/python3.7
+```
+***2、编译server***
+```
+mkdir build-server-npu && cd build-server-npu
+cmake -DPYTHON_INCLUDE_DIR=$PYTHON_INCLUDE_DIR/ \
+    -DPYTHON_LIBRARIES=$PYTHON_LIBRARIES \
+    -DPYTHON_EXECUTABLE=$PYTHON_EXECUTABLE \
+    -DCMAKE_INSTALL_PREFIX=./output \
+    -DWITH_ASCEND_CL=ON \
+    -DWITH_LITE=ON \
+    -DSERVER=ON ..
+make TARGET=ARMV8 -j16
+```
+***3、安装编译包***
+编译步骤完成后，会在各自编译目录$build_dir/python/dist生成whl包，分别安装即可。例如server步骤，会在server-build-npu/python/dist目录下生成whl包, 使用命令```pip install -u xxx.whl```进行安装。
+### 部署使用
+为了支持arm+昇腾310服务部署，启动服务时需使用以下参数。
+| 参数     | 参数说明                    | 备注                                                             |
+| :------- | :-------------------------- | :--------------------------------------------------------------- |
+| use_ascend_cl | 使用Ascend CL进行预测      | 使用Ascend预测能力                                      |
+| use_lite | 使用Paddle-Lite Engine      | 使用Paddle-Lite cpu预测能力                                      |
+以[resnet50](../examples/C++/PaddleClas/resnet_v2_50/README_CN.md)为例
+启动rpc服务，使用Paddle-Lite npu优化加速能力
+```
+python3 -m paddle_serving_server.serve --model resnet_v2_50_imagenet_model --thread 6 --port 9292 --use_ascend_cl --use_lite
+```
+## 其他说明
+### NPU芯片支持相关参考资料
+* [昇腾NPU芯片运行飞桨](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/09_hardware_support/npu_docs/paddle_install_cn.html)
\ No newline at end of file
--- a/doc/Serving_Configure_CN.md
+++ b/doc/Serving_Configure_CN.md
@@ -364,11 +364,41 @@ dag:
    tracer:
        interval_s: 10
+    #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
+    #client_type: local_predictor
+    #channel的最大长度，默认为0
+    #channel_size: 0
+    #针对大模型分布式场景tensor并行，接收第一个返回结果后其他结果丢弃来提供速度
+    #channel_recv_frist_arrive: False
 op:
    det:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 6
+        #Serving IPs
+        #server_endpoints: ["127.0.0.1:9393"]
+        #Fetch结果列表，以client_config中fetch_var的alias_name为准
+        #fetch_list: ["concat_1.tmp_0"]
+        #det模型client端配置
+        #client_config: serving_client_conf.prototxt
+        #Serving交互超时时间, 单位ms
+        #timeout: 3000
+        #Serving交互重试次数，默认不重试
+        #retry: 1
+        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
+        #batch_size: 2
+        # 批量查询超时，与batch_size配合使用
+        #auto_batching_timeout: 2000
        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:
            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
@@ -399,6 +429,27 @@ op:
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
+            #mem_optim, memory / graphic memory optimization
+            #mem_optim: True
+            #use_calib, Use TRT int8 calibration
+            #use_calib: False
+            #use_mkldnn, Use mkldnn for cpu
+            #use_mkldnn: False
+            #The cache capacity of different input shapes for mkldnn
+            #mkldnn_cache_capacity: 0
+            #mkldnn_op_list, op list accelerated using MKLDNN, None default
+            #mkldnn_op_list: []
+            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
+            #mkldnn_bf16_op_list: []
+            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
+            #min_subgraph_size: 3
    rec:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 3

--- a/doc/Serving_Configure_EN.md
+++ b/doc/Serving_Configure_EN.md
@@ -369,11 +369,41 @@ dag:
    tracer:
        interval_s: 10
+    #client type，include brpc, grpc and local_predictor.
+    #client_type: local_predictor
+    # max channel size, default 0
+    #channel_size: 0
+    #For distributed large model scenario with tensor parallelism, the first result is received and the other results are discarded to provide speed
+    #channel_recv_frist_arrive: False
 op:
    det:
        #concurrency，is_thread_op=True，thread otherwise process
        concurrency: 6
+        #Serving IPs
+        #server_endpoints: ["127.0.0.1:9393"]
+        #Fetch data list
+        #fetch_list: ["concat_1.tmp_0"]
+        #det client config
+        #client_config: serving_client_conf.prototxt
+        #Serving timeout, ms
+        #timeout: 3000
+        #Serving retry times
+        #retry: 1
+        #Default 1。batch_size>1 should set auto_batching_timeout
+        #batch_size: 2
+        #Batching timeout，used with batch_size
+        #auto_batching_timeout: 2000
        #Loading local server configuration without server_endpoints.
        local_service_conf:
            #client type，include brpc, grpc and local_predictor.
@@ -404,6 +434,27 @@ op:
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
+            #mem_optim, memory / graphic memory optimization
+            #mem_optim: True
+            #use_calib, Use TRT int8 calibration
+            #use_calib: False
+            #use_mkldnn, Use mkldnn for cpu
+            #use_mkldnn: False
+            #The cache capacity of different input shapes for mkldnn
+            #mkldnn_cache_capacity: 0
+            #mkldnn_op_list, op list accelerated using MKLDNN, None default
+            #mkldnn_op_list: []
+            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
+            #mkldnn_bf16_op_list: []
+            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
+            #min_subgraph_size: 3
    rec:
        #concurrency，is_thread_op=True，thread otherwise process
        concurrency: 3