# Serving Configuration ([简体中文](./Serving_Configure_CN.md)|English) ## Overview This guide focuses on Paddle C++ Serving and Python Pipeline configuration: - [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output. - [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration. - [Python Pipeline](#python-pipeline): Multiple model combined scenarios. ## Model Configuration The model configuration is generated by converting PaddleServing model and named serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](./Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`. Example: ``` feed_var { name: "x" alias_name: "x" is_lod_tensor: false feed_type: 1 shape: 13 } fetch_var { name: "concat_1.tmp_0" alias_name: "concat_1.tmp_0" is_lod_tensor: false fetch_type: 1 shape: 3 shape: 640 shape: 640 } ``` - feed_var:model input - fetch_var:model output - name:node name - alias_name:alias name - is_lod_tensor:lod tensor, ref to [Lod Introduction](./LOD_EN.md) - feed_type/fetch_type:data type |feed_type|类型| |---------|----| |0|INT64| |1|FLOAT32| |2|INT32| |3|FP64| |4|INT16| |5|FP16| |6|BF16| |7|UINT8| |8|INT8| - shape:tensor shape ## C++ Serving ### 1. Quick start and stop The easiest way to start c++ serving is to provide the `--model` and `--port` flags. Example starting c++ serving: ```BASH python3 -m paddle_serving_server.serve --model serving_model --port 9393 ``` This command will generate the server configuration files as `workdir_9393`: ``` workdir_9393 ├── general_infer_0 │   ├── fluid_time_file │   ├── general_model.prototxt │   └── model_toolkit.prototxt ├── infer_service.prototxt ├── resource.prototxt └── workflow.prototxt ``` More flags: | Argument | Type | Default | Description | | ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | | `thread` | int | `2` | Number of brpc service thread | | `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode | | `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode | | `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | | `port` | int | `9292` | Exposed port of current service to users | | `model` | str[]| `""` | Path of paddle model directory to be served | | `mem_optim_off` | - | - | Disable memory / graphic memory optimization | | `ir_optim` | bool | False | Enable analysis and optimization of calculation graph | | `use_mkl` (Only for cpu version) | - | - | Run inference with MKL. Need open with ir_optim. | | `use_trt` (Only for trt version) | - | - | Run inference with TensorRT. Need open with ir_optim. | | `use_lite` (Only for Intel x86 CPU or ARM CPU) | - | - | Run PaddleLite inference. Need open with ir_optim. | | `use_xpu` | - | - | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim. | | `precision` | str | FP32 | Precision Mode, support FP32, FP16, INT8 | | `use_calib` | bool | False | Use TRT int8 calibration | | `gpu_multi_stream` | bool | False | EnableGpuMultiStream to get larger QPS | #### Serving model with multiple gpus. ```BASH python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2 ``` #### Serving two models. ```BASH python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292 ``` #### Stop Serving(execute the following command in the directory where start serving or the path which environment variable SERVING_HOME set). ```BASH python3 -m paddle_serving_server.serve stop ``` `stop` sends SIGINT to C++ Serving. When setting `kill`, SIGKILL will be sent to C++ Serving ### 2. Starting with user-defined Configuration Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf. Example starting with self-defined config: ```BASH /bin/serving --flagfile=proj.conf ``` #### 2.1 proj.conf You can provide proj.conf with lots of flags: ``` # for paddle inference --precision=fp32 --use_calib=False --reload_interval_s=10 # for brpc --max_concurrency=0 --num_threads=10 --bthread_concurrency=10 --max_body_size=536870912 # default path --inferservice_path=conf --inferservice_file=infer_service.prototxt --resource_path=conf --resource_file=resource.prototxt --workflow_path=conf --workflow_file=workflow.prototxt ``` The table below sets out the detailed description: | name | Default | Description | |------|--------|------| |precision|"fp32"|Precision Mode, support FP32, FP16, INT8| |use_calib|False|Only for deployment with TensorRT| |reload_interval_s|10|Reload interval| |max_concurrency|0|Limit of request processing in parallel, 0: unlimited| |num_threads|10|Number of brpc service thread| |bthread_concurrency|10|Number of bthread| |max_body_size|536870912|Max size of brpc message| |inferservice_path|"conf"|Path of inferservice conf| |inferservice_file|"infer_service.prototxt"|Filename of inferservice conf| |resource_path|"conf"|Path of resource conf| |resource_file|"resource.prototxt"|Filename of resource conf| |workflow_path|"conf"|Path of workflow conf| |workflow_file|"workflow.prototxt"|Filename of workflow conf| #### 2.2 service.prototxt To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`. ``` port: 8010 services { name: "GeneralModelService" workflows: "workflow1" } ``` - port: Listening port. - services: No need to modify. The workflow1 is defined in workflow.prototxt. #### 2.3 workflow.prototxt To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`. In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data. ``` workflows { name: "workflow1" workflow_type: "Sequence" nodes { name: "general_reader_0" type: "GeneralReaderOp" } nodes { name: "general_infer_0" type: "GeneralInferOp" dependencies { name: "general_reader_0" mode: "RO" } } nodes { name: "general_response_0" type: "GeneralResponseOp" dependencies { name: "general_infer_0" mode: "RO" } } } ``` - name: The name of workflow. - workflow_type: "Sequence" - nodes: A workflow consists of nodes. - node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py` - node.type: The bound operator. Ref to OPS in `serving/op`. - node.dependencies: The list of upstream dependent operators. - node.dependencies.name: The name of dependent operators. - node.dependencies.mode: RO-Read Only, RW-Read Write #### 2.4 resource.prototxt You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`. ``` model_toolkit_path: "conf" model_toolkit_file: "general_infer_0/model_toolkit.prototxt" general_model_path: "conf" general_model_file: "general_infer_0/general_model.prototxt" ``` - model_toolkit_path: The diectory path of model_toolkil.prototxt. - model_toolkit_file: The file name of model_toolkil.prototxt. - general_model_path: The diectory path of general_model.prototxt. - general_model_file: The file name of general_model.prototxt. #### 2.5 model_toolkit.prototxt The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`. Example using cpu engine: ``` engines { name: "general_infer_0" type: "PADDLE_INFER" reloadable_meta: "uci_housing_model/fluid_time_file" reloadable_type: "timestamp_ne" model_dir: "uci_housing_model" gpu_ids: -1 enable_memory_optimization: true enable_ir_optimization: false use_trt: false use_lite: false use_xpu: false use_gpu: false combined_model: false gpu_multi_stream: false runtime_thread_num: 0 batch_infer_size: 32 enable_overrun: false allow_split_request: true } ``` - name: The name of engine corresponding to the node name in workflow.prototxt. - type: Only support ”PADDLE_INFER“ - reloadable_meta: Specify the mark file of reload. - reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none |reloadable_type|Description| |---------------|----| |timestamp_ne|when the mtime of reloadable_meta file changed| |timestamp_gt|When the mtime of reloadable_meta file greater than last record| |md5sum|No use| |revision|No use| - model_dir: The path of model files. - gpu_ids: Specify the gpu ids. Support multiple device ids: ``` # GPU0,1,2 gpu_ids: 0 gpu_ids: 1 gpu_ids: 2 ``` - enable_memory_optimization: Enable memory optimization. - enable_ir_optimization: Enable ir optimization. - use_trt: Enable Tensor RT. Need use_gpu on. - use_lite: Enable PaddleLite. - use_xpu: Enable KUNLUN XPU. - use_gpu: Enbale GPU. - combined_model: Enable combined model. - gpu_multi_stream: Enable gpu multiple stream mode. - runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors. - batch_infer_size: The max batch size of Async mode. - enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue. - allow_split_request: Allow to split request task in Async mode. #### 2.6 general_model.prototxt The content of general_model.prototxt is same as serving_server_conf.prototxt. ``` feed_var { name: "x" alias_name: "x" is_lod_tensor: false feed_type: 1 shape: 13 } fetch_var { name: "fc_0.tmp_1" alias_name: "price" is_lod_tensor: false fetch_type: 1 shape: 1 } ``` ## Python Pipeline ### Quick start and stop Example starting Pipeline Serving: ```BASH python3 web_service.py ``` ### Stop Serving(execute the following command in the directory where start Pipeline serving or the path which environment variable SERVING_HOME set). ```BASH python3 -m paddle_serving_server.serve stop ``` `stop` sends SIGINT to Pipeline Serving. When setting `kill`, SIGKILL will be sent to Pipeline Serving ### yml Configuration Python Pipeline provides a user-friendly programming framework for multi-model composite services. Example of config.yaml: ```YAML #RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1. rpc_port: 18090 #HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated http_port: 9999 #worker_num, the maximum concurrency. #When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG. #When build_dag_each_worker=False, server will set the threadpool of GRPC. worker_num: 20 #build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG build_dag_each_worker: false dag: #True, thread model;False,process model is_thread_op: False #retry times retry: 1 # True,generate the TimeLine data;False use_profile: false tracer: interval_s: 10 op: det: #concurrency,is_thread_op=True,thread otherwise process concurrency: 6 #Loading local server configuration without server_endpoints. local_service_conf: #client type,include brpc, grpc and local_predictor. client_type: local_predictor #det model path model_config: ocr_det_model #Fetch data list fetch_list: ["concat_1.tmp_0"] # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu device_type: 0 #Device ID devices: "" #use_mkldnn, When running on mkldnn,must set ir_optim=True #use_mkldnn: True #ir_optim, When running on TensorRT,must set ir_optim=True ir_optim: True #precsion, Decrease accuracy can increase speed #GPU 支持: "fp32"(default), "fp16", "int8"; #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" precision: "fp32" rec: #concurrency,is_thread_op=True,thread otherwise process concurrency: 3 #time out, ms timeout: -1 #retry times retry: 1 #Loading local server configuration without server_endpoints. local_service_conf: #client type,include brpc, grpc and local_predictor. client_type: local_predictor #rec model path model_config: ocr_rec_model #Fetch data list fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"] # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu device_type: 0 #Device ID devices: "" #use_mkldnn, When running on mkldnn,must set ir_optim=True #use_mkldnn: True #ir_optim, When running on TensorRT,must set ir_optim=True ir_optim: True #precsion, Decrease accuracy can increase speed #GPU 支持: "fp32"(default), "fp16", "int8"; #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8" precision: "fp32" ``` ### Single-machine and multi-card inference Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2. Reference config.yaml: ```YAML #True, thread model;False,process model is_thread_op: False #concurrency,is_thread_op=True,thread otherwise process concurrency: 7 devices: "0,1,2" ``` ### Heterogeneous Devices In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows: - CPU(Intel) : 0 - GPU : 1 - TensorRT : 2 - CPU(Arm) : 3 - XPU : 4 Reference config.yaml: ```YAML # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu device_type: 0 devices: "" # "0,1" ``` ### Low precision inference Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below: - CPU - fp32(default) - fp16 - bf16(mkldnn) - GPU - fp32(default) - fp16(TRT effects) - int8 - Tensor RT - fp32(default) - fp16 - int8 ```YAML #precsion #GPU support: "fp32"(default), "fp16(TensorRT)", "int8"; #CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8" precision: "fp32" ```