diff --git a/doc/SERVING_CONFIGURE.md b/doc/SERVING_CONFIGURE.md index 60828bba51c54f5b7f49a0977ccef45caa0d1218..fd7a572625870e8c2799e8708bd47a2dd1f58db9 100644 --- a/doc/SERVING_CONFIGURE.md +++ b/doc/SERVING_CONFIGURE.md @@ -1,17 +1,20 @@ # Serving Configuration -## 简介 +([简体中文](SERVING_CONFIGURATION_CN.md)|English) -本文主要介绍C++ Server以及Python Server的各项配置: +## Overview -- [模型配置文件](#模型配置文件): 转换模型时自动生成,描述模型输入输出信息 -- [C++ Server](#c-server): 用于高性能场景,介绍了快速启动以及自定义配置方法 -- [Python Server](#python-server): 用于单算子多模型组合场景 +This guide focuses on Paddle C++ Server and Python Server configuration: -## 模型配置文件 +- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output. +- [C++ Server](#c-server): High-performance scenarios. Specify how to start quickly and start with user-defined configuration. +- [Python Server](#python-server): Multiple model combined scenarios. -在开始介绍Server配置之前,先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时,会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt,两者内容一致,为模型输入输出的参数信息,方便用户拼装参数。该配置文件用于Server以及Client,并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`。 -样例如下: +## Model Configuration + +The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](SAVE.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`. + +Example: ``` feed_var { @@ -22,30 +25,49 @@ feed_var { shape: 13 } fetch_var { - name: "fc_0.tmp_1" - alias_name: "price" + name: "concat_1.tmp_0" + alias_name: "concat_1.tmp_0" is_lod_tensor: false fetch_type: 1 - shape: 1 + shape: 3 + shape: 640 + shape: 640 } ``` -其中 - -- feed_var为输入变量,fetch_var为输出变量 -- 具体配置:name(名称)、alias_name(别名,与名称对应,用于输出)、is_lod_tensor(是否为lod)、feed_type/fetch_type(数据类型)、shape(数据维度) +- feed_var:model input +- fetch_var:model output +- name:node name +- alias_name:alias name +- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD.md) +- feed_type/fetch_type:data type + +|feed_type|类型| +|---------|----| +|0|INT64| +|1|FLOAT32| +|2|INT32| +|3|FP64| +|4|INT16| +|5|FP16| +|6|BF16| +|7|UINT8| +|8|INT8| + +- shape:tensor shape ## C++ Server -### 1.快速启动 +### 1. Quick start -可以通过配置模型及端口号快速启动服务,启动命令如下: +The easiest way to start c++ server is to provide the `--model` and `--port` flags. +Example starting c++ server: ```BASH python3 -m paddle_serving_server.serve --model serving_model --port 9393 ``` -该命令会自动生成配置文件,并使用生成的配置文件启动C++ Server。例如上述启动命令会自动生成workdir_9393目录,其结构如下 +This command will generate the server configuration files as `workdir_9393`: ``` workdir_9393 @@ -58,7 +80,7 @@ workdir_9393 └── workflow.prototxt ``` -更多启动参数详见下表: +More flags: | Argument | Type | Default | Description | | ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | | `thread` | int | `2` | Number of brpc service thread | @@ -77,25 +99,28 @@ workdir_9393 | `use_calib` | bool | False | Use TRT int8 calibration | | `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | -#### 当您的某个模型想使用多张GPU卡部署时. +#### Serving model with multiple gpus. ```BASH python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2 ``` -#### 当您的一个服务包含两个模型部署时. +#### Serving two models. ```BASH python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292 ``` -### 2.自定义配置启动 +### 2. Starting with user-defined Configuration + +Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf. + +Example starting with self-defined config: -Serving支持自定义配置方式启动,可以使用自行定义的配置文件启动服务。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下: ```BASH /bin/serving --flagfile=proj.conf ``` #### 2.1 proj.conf -proj.conf用于传入服务参数,并指定了其他相关配置文件的路径 +You can provide proj.conf with lots of flags: ``` # for paddle inference --precision=fp32 @@ -114,7 +139,8 @@ proj.conf用于传入服务参数,并指定了其他相关配置文件的路 --workflow_path=conf --workflow_file=workflow.prototxt ``` -各项参数的描述及默认值详见下表: + +The table below sets out the detailed description: | name | Default | Description | |------|--------|------| |precision|"fp32"|Precision Mode, support FP32, FP16, INT8| @@ -133,7 +159,7 @@ proj.conf用于传入服务参数,并指定了其他相关配置文件的路 #### 2.2 service.prototxt -service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path`和`--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf`的`InferServiceConf`。示例如下: +To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`. ``` port: 8010 @@ -143,13 +169,14 @@ services { } ``` -其中: -- port: 用于配置Serving实例监听的端口号。 -- services: 使用默认配置即可,不可修改。name指定service名称,workflow1的具体定义在workflow.prototxt +- port: Listening port. +- services: No need to modify. The workflow1 is defined in workflow.prototxt. #### 2.3 workflow.prototxt -workflow.prototxt用来描述具体的workflow。通过`--workflow_path`和`--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。示例如下: +To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`. + +In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data. ``` workflows { @@ -177,20 +204,20 @@ workflows { } } ``` -其中: - -- name: workflow名称,用于从service.prototxt索引到具体的workflow -- workflow_type: 只支持"Sequence" -- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来 -- node.name: 与node.type一一对应,具体可参考`python/paddle_serving_server/dag.py` -- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应 -- node.dependencies: 依赖的上游node列表 -- node.dependencies.name: 与workflow内节点的name保持一致 + +- name: The name of workflow. +- workflow_type: "Sequence" +- nodes: A workflow consists of nodes. +- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py` +- node.type: The bound operator. Ref to OPS in `serving/op`. +- node.dependencies: The list of upstream dependent operators. +- node.dependencies.name: The name of dependent operators. - node.dependencies.mode: RO-Read Only, RW-Read Write #### 2.4 resource.prototxt -resource.prototxt,用于指定模型配置文件。通过`--resource_path`和`--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto`的`ResourceConf`。示例如下: +You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`. + ``` model_toolkit_path: "conf" @@ -199,16 +226,16 @@ general_model_path: "conf" general_model_file: "general_infer_0/general_model.prototxt" ``` -其中: - -- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录 -- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名 -- general_model_path: 用来指定general_model.prototxt所在的目录 -- general_model_file: 用来指定general_model.prototxt所在的文件名 +- model_toolkit_path: The diectory path of model_toolkil.prototxt. +- model_toolkit_file: The file name of model_toolkil.prototxt. +- general_model_path: The diectory path of general_model.prototxt. +- general_model_file: The file name of general_model.prototxt. #### 2.5 model_toolkit.prototxt -用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下: +The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`. + +Example using cpu engine: ``` engines { @@ -233,38 +260,43 @@ engines { } ``` -其中 - -- name: 引擎名称,与workflow.prototxt中的node.name以及所在目录名称对应 -- type: 预测引擎的类型。当前只支持”PADDLE_INFER“ -- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值 -- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none +- name: The name of engine corresponding to the node name in workflow.prototxt. +- type: Only support ”PADDLE_INFER“ +- reloadable_meta: Specify the mark file of reload. +- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none -|reloadable_type|含义| +|reloadable_type|Description| |---------------|----| -|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化| -|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳| -|md5sum|目前无用,配置后永远不reload| -|revision|目前无用,配置后用于不reload| - -- model_dir: 模型文件路径 -- gpu_ids: 引擎运行时使用的GPU device id,支持指定多个 -- enable_memory_optimization: 是否开启memory优化 -- enable_ir_optimization: 是否开启ir优化 -- use_trt: 是否开启TensorRT,需同时开启use_gpu -- use_lite: 是否开启PaddleLite -- use_xpu: 是否使用昆仑XPU -- use_gpu:是否使用GPU -- combined_model: 是否使用组合模型文件 -- gpu_multi_stream: 是否开启gpu多流模式 -- runtime_thread_num: 若大于0, 则启用Async异步模式,并创建对应数量的predictor实例。 -- batch_infer_size: Async异步模式下的最大batch数 -- enable_overrun: Async异步模式下总是将整个任务放入任务队列 -- allow_split_request: Async异步模式下允许拆分任务 +|timestamp_ne|when the mtime of reloadable_meta file changed| +|timestamp_gt|When the mtime of reloadable_meta file greater than last record| +|md5sum|No use| +|revision|No use| + +- model_dir: The path of model files. +- gpu_ids: Specify the gpu ids. Support multiple device ids: +``` +# GPU0,1,2 +gpu_ids: 0 +gpu_ids: 1 +gpu_ids: 2 +``` +- enable_memory_optimization: Enable memory optimization. +- enable_ir_optimization: Enable ir optimization. +- use_trt: Enable Tensor RT. Need use_gpu on. +- use_lite: Enable PaddleLite. +- use_xpu: Enable KUNLUN XPU. +- use_gpu: Enbale GPU. +- combined_model: Enable combined model. +- gpu_multi_stream: Enable gpu multiple stream mode. +- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors. +- batch_infer_size: The max batch size of Async mode. +- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue. +- allow_split_request: Allow to split request task in Async mode. #### 2.6 general_model.prototxt -general_model.prototxt内容与模型配置serving_server_conf.prototxt相同,用了描述模型输入输出参数信息。示例如下: +The content of general_model.prototxt is same as serving_server_conf.prototxt. + ``` feed_var { name: "x" @@ -284,51 +316,53 @@ fetch_var { ## Python Server -Python Server提供了用户友好的多模型组合服务编程框架,适用于多模型组合应用的场景。 -其配置文件为YAML格式,一般默认为config.yaml。示例如下: +Python Server provides a user-friendly programming framework for multi-model composite services. + +Example of config.yaml: ```YAML -#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1. rpc_port: 18090 -#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated http_port: 9999 -#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG -##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +#worker_num, the maximum concurrency. +#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG. +#When build_dag_each_worker=False, server will set the threadpool of GRPC. worker_num: 20 -#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG build_dag_each_worker: false dag: - #op资源类型, True, 为线程模型;False,为进程模型 + #True, thread model;False,process model is_thread_op: False - #重试次数 + #retry times retry: 1 - #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + # True,generate the TimeLine data;False use_profile: false tracer: interval_s: 10 op: det: - #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + #concurrency,is_thread_op=True,thread otherwise process concurrency: 6 - #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + #Loading local server configuration without server_endpoints. local_service_conf: - #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + #client type,include brpc, grpc and local_predictor. client_type: local_predictor - #det模型路径 + #det model path model_config: ocr_det_model - #Fetch结果列表,以client_config中fetch_var的alias_name为准 + #Fetch data list fetch_list: ["concat_1.tmp_0"] - #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + #Device ID devices: "" # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu @@ -337,34 +371,31 @@ op: #use_mkldnn #use_mkldnn: True - #thread_num - thread_num: 2 - #ir_optim ir_optim: True rec: - #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + #concurrency,is_thread_op=True,thread otherwise process concurrency: 3 - #超时时间, 单位ms + #time out, ms timeout: -1 - #Serving交互重试次数,默认不重试 + #retry times retry: 1 - #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + #Loading local server configuration without server_endpoints. local_service_conf: - #client类型,包括brpc, grpc和local_predictor。local_predictor不启动Serving服务,进程内预测 + #client type,include brpc, grpc and local_predictor. client_type: local_predictor - #rec模型路径 + #rec model path model_config: ocr_rec_model - #Fetch结果列表,以client_config中fetch_var的alias_name为准 + #Fetch data list fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"] - #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + #Device ID devices: "" # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu @@ -373,53 +404,51 @@ op: #use_mkldnn #use_mkldnn: True - #thread_num - thread_num: 2 - #ir_optim ir_optim: True ``` -### 单机多卡 +### Single-machine and multi-card inference + +Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2. -单机多卡推理,M个OP进程与N个GPU卡绑定,需要在config.ymal中配置3个参数。首先选择进程模式,这样并发数即进程数,然后配置devices。绑定方法是进程启动时遍历GPU卡ID,例如启动7个OP进程,设置了0,1,2三个device id,那么第1、4、7个启动的进程与0卡绑定,第2、5进程与1卡绑定,3、6进程与卡2绑定。 +Reference config.yaml: ```YAML -#op资源类型, True, 为线程模型;False,为进程模型 +#True, thread model;False,process model is_thread_op: False -#并发数,is_thread_op=True时,为线程并发;否则为进程并发 +#concurrency,is_thread_op=True,thread otherwise process concurrency: 7 devices: "0,1,2" ``` -### 异构硬件 +### Heterogeneous Devices -Python Server除了支持CPU、GPU之外,还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定,当其空缺时根据devices自动判断类型。device_type描述如下: +In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows: - CPU(Intel) : 0 - GPU : 1 - TensorRT : 2 - CPU(Arm) : 3 - XPU : 4 -config.yml中硬件配置: +Reference config.yaml: ```YAML -#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu +# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu device_type: 0 -#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 devices: "" # "0,1" ``` -### 低精度推理 +### Low precision inference -Python Serving支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下所示: +Python Server supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below: - CPU - fp32(default) - fp16 - bf16(mkldnn) - GPU - fp32(default) - - fp16 + - fp16(TRT effects) - int8 - Tensor RT - fp32(default) @@ -427,8 +456,8 @@ Python Serving支持低精度推理,CPU、GPU和TensoRT支持的精度类型 - int8 ```YAML -#precsion, 预测精度,降低预测精度可提升预测速度 -#GPU 支持: "fp32"(default), "fp16", "int8"; -#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" +#precsion +#GPU support: "fp32"(default), "fp16(TensorRT)", "int8"; +#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8" precision: "fp32" ``` \ No newline at end of file diff --git a/doc/SERVING_CONFIGURE_CN.md b/doc/SERVING_CONFIGURE_CN.md new file mode 100644 index 0000000000000000000000000000000000000000..03264b039761b68542d8ae0173f17b95ad67cc59 --- /dev/null +++ b/doc/SERVING_CONFIGURE_CN.md @@ -0,0 +1,456 @@ +# Serving Configuration + +(简体中文|[English](SERVING_CONFIGURATION.md)) + +## 简介 + +本文主要介绍C++ Server以及Python Server的各项配置: + +- [模型配置文件](#模型配置文件): 转换模型时自动生成,描述模型输入输出信息 +- [C++ Server](#c-server): 用于高性能场景,介绍了快速启动以及自定义配置方法 +- [Python Server](#python-server): 用于单算子多模型组合场景 + +## 模型配置文件 + +在开始介绍Server配置之前,先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时,会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt,两者内容一致,为模型输入输出的参数信息,方便用户拼装参数。该配置文件用于Server以及Client,并不需要用户自行修改。转换方法参考文档《[怎样保存用于Paddle Serving的模型](SAVE_CN.md)》。protobuf格式可参考`core/configure/proto/general_model_config.proto`。 +样例如下: + +``` +feed_var { + name: "x" + alias_name: "x" + is_lod_tensor: false + feed_type: 1 + shape: 13 +} +fetch_var { + name: "concat_1.tmp_0" + alias_name: "concat_1.tmp_0" + is_lod_tensor: false + fetch_type: 1 + shape: 3 + shape: 640 + shape: 640 +} +``` + +其中 +- feed_var:模型输入 +- fetch_var:模型输出 +- name:名称 +- alias_name:别名,与名称对应 +- is_lod_tensor:是否为lod,具体可参考《[Lod字段说明](LOD_CN.md)》 +- feed_type:数据类型 + +|feed_type|类型| +|---------|----| +|0|INT64| +|1|FLOAT32| +|2|INT32| +|3|FP64| +|4|INT16| +|5|FP16| +|6|BF16| +|7|UINT8| +|8|INT8| + +- shape:数据维度 + +## C++ Server + +### 1.快速启动 + +可以通过配置模型及端口号快速启动服务,启动命令如下: + +```BASH +python3 -m paddle_serving_server.serve --model serving_model --port 9393 +``` + +该命令会自动生成配置文件,并使用生成的配置文件启动C++ Server。例如上述启动命令会自动生成workdir_9393目录,其结构如下 + +``` +workdir_9393 +├── general_infer_0 +│   ├── fluid_time_file +│   ├── general_model.prototxt +│   └── model_toolkit.prototxt +├── infer_service.prototxt +├── resource.prototxt +└── workflow.prototxt +``` + +更多启动参数详见下表: +| Argument | Type | Default | Description | +| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | +| `thread` | int | `2` | Number of brpc service thread | +| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode | +| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode | +| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | +| `port` | int | `9292` | Exposed port of current service to users | +| `model` | str[]| `""` | Path of paddle model directory to be served | +| `mem_optim_off` | - | - | Disable memory / graphic memory optimization | +| `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | +| `use_mkl` (Only for cpu version) | - | - | Run inference with MKL | +| `use_trt` (Only for trt version) | - | - | Run inference with TensorRT | +| `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference | +| `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU | +| `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | +| `use_calib` | bool | False | Use TRT int8 calibration | +| `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | + +#### 当您的某个模型想使用多张GPU卡部署时. +```BASH +python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2 +``` +#### 当您的一个服务包含两个模型部署时. +```BASH +python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292 +``` + +### 2.自定义配置启动 + +一般情况下,自动生成的配置可以应对大部分场景。对于特殊场景,用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下: +```BASH +/bin/serving --flagfile=proj.conf +``` + +#### 2.1 proj.conf + +proj.conf用于传入服务参数,并指定了其他相关配置文件的路径。如果重复传入参数,则以最后序参数值为准。 +``` +# for paddle inference +--precision=fp32 +--use_calib=False +--reload_interval_s=10 +# for brpc +--max_concurrency=0 +--num_threads=10 +--bthread_concurrency=10 +--max_body_size=536870912 +# default path +--inferservice_path=conf +--inferservice_file=infer_service.prototxt +--resource_path=conf +--resource_file=resource.prototxt +--workflow_path=conf +--workflow_file=workflow.prototxt +``` +各项参数的描述及默认值详见下表: +| name | Default | Description | +|------|--------|------| +|precision|"fp32"|Precision Mode, support FP32, FP16, INT8| +|use_calib|False|Only for deployment with TensorRT| +|reload_interval_s|10|Reload interval| +|max_concurrency|0|Limit of request processing in parallel, 0: unlimited| +|num_threads|10|Number of brpc service thread| +|bthread_concurrency|10|Number of bthread| +|max_body_size|536870912|Max size of brpc message| +|inferservice_path|"conf"|Path of inferservice conf| +|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf| +|resource_path|"conf"|Path of resource conf| +|resource_file|"resource.prototxt"|Filename of resource conf| +|workflow_path|"conf"|Path of workflow conf| +|workflow_file|"workflow.prototxt"|Filename of workflow conf| + +#### 2.2 service.prototxt + +service.prototxt用于配置Paddle Serving实例挂载的service列表。通过`--inferservice_path`和`--inferservice_file`指定加载路径。protobuf格式可参考`core/configure/server_configure.protobuf`的`InferServiceConf`。示例如下: + +``` +port: 8010 +services { + name: "GeneralModelService" + workflows: "workflow1" +} +``` + +其中: +- port: 用于配置Serving实例监听的端口号。 +- services: 使用默认配置即可,不可修改。name指定service名称,workflow1的具体定义在workflow.prototxt + +#### 2.3 workflow.prototxt + +workflow.prototxt用来描述具体的workflow。通过`--workflow_path`和`--workflow_file`指定加载路径。protobuf格式可参考`configure/server_configure.protobuf`的`Workflow`类型。 +如下示例,workflow由3个OP构成,GeneralReaderOp用于读取数据,GeneralInferOp依赖于GeneralReaderOp并进行预测,GeneralResponseOp将预测结果返回: + +``` +workflows { + name: "workflow1" + workflow_type: "Sequence" + nodes { + name: "general_reader_0" + type: "GeneralReaderOp" + } + nodes { + name: "general_infer_0" + type: "GeneralInferOp" + dependencies { + name: "general_reader_0" + mode: "RO" + } + } + nodes { + name: "general_response_0" + type: "GeneralResponseOp" + dependencies { + name: "general_infer_0" + mode: "RO" + } + } +} +``` +其中: + +- name: workflow名称,用于从service.prototxt索引到具体的workflow +- workflow_type: 只支持"Sequence" +- nodes: 用于串联成workflow的所有节点,可配置多个nodes。nodes间通过配置dependencies串联起来 +- node.name: 与node.type一一对应,具体可参考`python/paddle_serving_server/dag.py` +- node.type: 当前node所执行OP的类名称,与serving/op/下每个具体的OP类的名称对应 +- node.dependencies: 依赖的上游node列表 +- node.dependencies.name: 与workflow内节点的name保持一致 +- node.dependencies.mode: RO-Read Only, RW-Read Write + +#### 2.4 resource.prototxt + +resource.prototxt,用于指定模型配置文件。通过`--resource_path`和`--resource_file`指定加载路径。它的protobuf格式参考`core/configure/proto/server_configure.proto`的`ResourceConf`。示例如下: + +``` +model_toolkit_path: "conf" +model_toolkit_file: "general_infer_0/model_toolkit.prototxt" +general_model_path: "conf" +general_model_file: "general_infer_0/general_model.prototxt" +``` + +其中: + +- model_toolkit_path:用来指定model_toolkit.prototxt所在的目录 +- model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名 +- general_model_path: 用来指定general_model.prototxt所在的目录 +- general_model_file: 用来指定general_model.prototxt所在的文件名 + +#### 2.5 model_toolkit.prototxt + +用来配置模型信息和预测引擎。它的protobuf格式参考`core/configure/proto/server_configure.proto`的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下: + +``` +engines { + name: "general_infer_0" + type: "PADDLE_INFER" + reloadable_meta: "uci_housing_model/fluid_time_file" + reloadable_type: "timestamp_ne" + model_dir: "uci_housing_model" + gpu_ids: -1 + enable_memory_optimization: true + enable_ir_optimization: false + use_trt: false + use_lite: false + use_xpu: false + use_gpu: false + combined_model: false + gpu_multi_stream: false + runtime_thread_num: 0 + batch_infer_size: 32 + enable_overrun: false + allow_split_request: true +} +``` + +其中 + +- name: 引擎名称,与workflow.prototxt中的node.name以及所在目录名称对应 +- type: 预测引擎的类型。当前只支持”PADDLE_INFER“ +- reloadable_meta: 目前实际内容无意义,用来通过对该文件的mtime判断是否超过reload时间阈值 +- reloadable_type: 检查reload条件:timestamp_ne/timestamp_gt/md5sum/revision/none + +|reloadable_type|含义| +|---------------|----| +|timestamp_ne|reloadable_meta所指定文件的mtime时间戳发生变化| +|timestamp_gt|reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳| +|md5sum|目前无用,配置后永远不reload| +|revision|目前无用,配置后用于不reload| + +- model_dir: 模型文件路径 +- gpu_ids: 引擎运行时使用的GPU device id,支持指定多个,如: +``` +# 指定GPU0,1,2 +gpu_ids: 0 +gpu_ids: 1 +gpu_ids: 2 +``` +- enable_memory_optimization: 是否开启memory优化 +- enable_ir_optimization: 是否开启ir优化 +- use_trt: 是否开启TensorRT,需同时开启use_gpu +- use_lite: 是否开启PaddleLite +- use_xpu: 是否使用昆仑XPU +- use_gpu:是否使用GPU +- combined_model: 是否使用组合模型文件 +- gpu_multi_stream: 是否开启gpu多流模式 +- runtime_thread_num: 若大于0, 则启用Async异步模式,并创建对应数量的predictor实例。 +- batch_infer_size: Async异步模式下的最大batch数 +- enable_overrun: Async异步模式下总是将整个任务放入任务队列 +- allow_split_request: Async异步模式下允许拆分任务 + +#### 2.6 general_model.prototxt + +general_model.prototxt内容与模型配置serving_server_conf.prototxt相同,用了描述模型输入输出参数信息。示例如下: +``` +feed_var { + name: "x" + alias_name: "x" + is_lod_tensor: false + feed_type: 1 + shape: 13 +} +fetch_var { + name: "fc_0.tmp_1" + alias_name: "price" + is_lod_tensor: false + fetch_type: 1 + shape: 1 +} +``` + +## Python Server + +Python Server提供了用户友好的多模型组合服务编程框架,适用于多模型组合应用的场景。 +其配置文件为YAML格式,一般默认为config.yaml。示例如下: +```YAML +#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时,会自动将rpc_port设置为http_port+1 +rpc_port: 18090 + +#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时,不自动生成http_port +http_port: 9999 + +#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程,每个进程内构建grpcSever和DAG +##当build_dag_each_worker=False时,框架会设置主线程grpc线程池的max_workers=worker_num +worker_num: 20 + +#build_dag_each_worker, False,框架在进程内创建一条DAG;True,框架会每个进程内创建多个独立的DAG +build_dag_each_worker: false + +dag: + #op资源类型, True, 为线程模型;False,为进程模型 + is_thread_op: False + + #重试次数 + retry: 1 + + #使用性能分析, True,生成Timeline性能数据,对性能有一定影响;False为不使用 + use_profile: false + tracer: + interval_s: 10 + +op: + det: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 6 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + #client类型,包括brpc, grpc和local_predictor.local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #det模型路径 + model_config: ocr_det_model + + #Fetch结果列表,以client_config中fetch_var的alias_name为准 + fetch_list: ["concat_1.tmp_0"] + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "" + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 0 + + #use_mkldnn + #use_mkldnn: True + + #ir_optim + ir_optim: True + rec: + #并发数,is_thread_op=True时,为线程并发;否则为进程并发 + concurrency: 3 + + #超时时间, 单位ms + timeout: -1 + + #Serving交互重试次数,默认不重试 + retry: 1 + + #当op配置没有server_endpoints时,从local_service_conf读取本地服务配置 + local_service_conf: + + #client类型,包括brpc, grpc和local_predictor。local_predictor不启动Serving服务,进程内预测 + client_type: local_predictor + + #rec模型路径 + model_config: ocr_rec_model + + #Fetch结果列表,以client_config中fetch_var的alias_name为准 + fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"] + + #计算硬件ID,当devices为""或不写时为CPU预测;当devices为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 + devices: "" + + # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu + device_type: 0 + + #use_mkldnn + #use_mkldnn: True + + #ir_optim + ir_optim: True +``` + +### 单机多卡 + +单机多卡推理,M个OP进程与N个GPU卡绑定,需要在config.ymal中配置3个参数。首先选择进程模式,这样并发数即进程数,然后配置devices。绑定方法是进程启动时遍历GPU卡ID,例如启动7个OP进程,设置了0,1,2三个device id,那么第1、4、7个启动的进程与0卡绑定,第2、5进程与1卡绑定,3、6进程与卡2绑定。 +```YAML +#op资源类型, True, 为线程模型;False,为进程模型 +is_thread_op: False + +#并发数,is_thread_op=True时,为线程并发;否则为进程并发 +concurrency: 7 + +devices: "0,1,2" +``` + +### 异构硬件 + +Python Server除了支持CPU、GPU之外,还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定,当其空缺时根据devices自动判断类型。device_type描述如下: +- CPU(Intel) : 0 +- GPU : 1 +- TensorRT : 2 +- CPU(Arm) : 3 +- XPU : 4 + +config.yml中硬件配置: +```YAML +#计算硬件类型: 空缺时由devices决定(CPU/GPU),0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu +device_type: 0 +#计算硬件ID,优先由device_type决定硬件类型。devices为""或空缺时为CPU预测;当为"0", "0,1,2"时为GPU预测,表示使用的GPU卡 +devices: "" # "0,1" +``` + +### 低精度推理 + +Python Serving支持低精度推理,CPU、GPU和TensoRT支持的精度类型如下所示: +- CPU + - fp32(default) + - fp16 + - bf16(mkldnn) +- GPU + - fp32(default) + - fp16(TRT下有效) + - int8 +- Tensor RT + - fp32(default) + - fp16 + - int8 + +```YAML +#precsion, 预测精度,降低预测精度可提升预测速度 +#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8"; +#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" +precision: "fp32" +``` \ No newline at end of file