Serving_Configure_EN.md 15.4 KB
Newer Older
S
ShiningZhang 已提交
1
# Serving Configuration
W
wangguibao 已提交
2

T
TeslaZhao 已提交
3
([简体中文](Serving_Configure_CN.md)|English)
W
wangguibao 已提交
4

5
## Overview
W
wangguibao 已提交
6

S
ShiningZhang 已提交
7
This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
W
wangguibao 已提交
8

9
- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
S
ShiningZhang 已提交
10 11
- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.
W
wangguibao 已提交
12

13 14
## Model Configuration

T
TeslaZhao 已提交
15
The model configuration is generated by converting PaddleServing model and named  serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
16 17

Example:
S
ShiningZhang 已提交
18 19 20 21 22 23 24 25 26 27

```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
28 29
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
S
ShiningZhang 已提交
30 31
  is_lod_tensor: false
  fetch_type: 1
32 33 34
  shape: 3
  shape: 640
  shape: 640
W
wangguibao 已提交
35 36 37
}
```

38 39 40 41
- feed_var:model input
- fetch_var:model output
- name:node name
- alias_name:alias name
T
TeslaZhao 已提交
42
- is_lod_tensor:lod tensor, ref to [Lod Introduction](LOD_EN.md)
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
- feed_type/fetch_type:data type

|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|

- shape:tensor shape
S
ShiningZhang 已提交
58

S
ShiningZhang 已提交
59
## C++ Serving
S
ShiningZhang 已提交
60

61
### 1. Quick start
S
ShiningZhang 已提交
62

S
ShiningZhang 已提交
63
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
S
ShiningZhang 已提交
64

S
ShiningZhang 已提交
65
Example starting c++ serving:
S
ShiningZhang 已提交
66 67 68 69
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```

70
This command will generate the server configuration files as `workdir_9393`:
S
ShiningZhang 已提交
71 72 73 74 75 76 77 78 79 80 81 82

```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```

83
More flags:
S
ShiningZhang 已提交
84 85 86 87 88 89 90 91 92 93 94
| Argument                                       | Type | Default | Description                                           |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL                                |
95 96 97
| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                            |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
S
ShiningZhang 已提交
98 99 100 101
| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |

102
#### Serving model with multiple gpus.
S
ShiningZhang 已提交
103 104 105
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
106
#### Serving two models.
S
ShiningZhang 已提交
107 108 109 110
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```

111 112 113 114 115
### 2. Starting with user-defined Configuration

Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.

Example starting with self-defined config:
S
ShiningZhang 已提交
116 117 118 119 120 121 122

```BASH
/bin/serving --flagfile=proj.conf
```

#### 2.1 proj.conf

123
You can provide proj.conf with lots of flags:
S
ShiningZhang 已提交
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
142 143

The table below sets out the detailed description:
S
ShiningZhang 已提交
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|

#### 2.2 service.prototxt

162
To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
S
ShiningZhang 已提交
163 164

```
W
wangguibao 已提交
165 166
port: 8010
services {
S
ShiningZhang 已提交
167
  name: "GeneralModelService"
W
wangguibao 已提交
168 169 170 171
  workflows: "workflow1"
}
```

172 173
- port: Listening port.
- services: No need to modify. The workflow1 is defined in workflow.prototxt.
W
wangguibao 已提交
174

S
ShiningZhang 已提交
175
#### 2.3 workflow.prototxt
W
wangguibao 已提交
176

177 178 179
To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.

In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.
W
wangguibao 已提交
180

S
ShiningZhang 已提交
181
```
W
wangguibao 已提交
182 183 184 185
workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
S
ShiningZhang 已提交
186 187
    name: "general_reader_0"
    type: "GeneralReaderOp"
W
wangguibao 已提交
188 189
  }
  nodes {
S
ShiningZhang 已提交
190 191
    name: "general_infer_0"
    type: "GeneralInferOp"
W
wangguibao 已提交
192
    dependencies {
S
ShiningZhang 已提交
193
      name: "general_reader_0"
W
wangguibao 已提交
194 195 196 197
      mode: "RO"
    }
  }
  nodes {
S
ShiningZhang 已提交
198 199
    name: "general_response_0"
    type: "GeneralResponseOp"
W
wangguibao 已提交
200
    dependencies {
S
ShiningZhang 已提交
201
      name: "general_infer_0"
W
wangguibao 已提交
202 203 204 205 206
      mode: "RO"
    }
  }
}
```
207 208 209 210 211 212 213 214

- name: The name of workflow.
- workflow_type: "Sequence"
- nodes: A workflow consists of nodes.
- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
- node.type: The bound operator. Ref to OPS in `serving/op`.
- node.dependencies: The list of upstream dependent operators.
- node.dependencies.name: The name of dependent operators.
215
- node.dependencies.mode: RO-Read Only, RW-Read Write
W
wangguibao 已提交
216

S
ShiningZhang 已提交
217
#### 2.4 resource.prototxt
W
wangguibao 已提交
218

219 220
You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.

W
wangguibao 已提交
221

S
ShiningZhang 已提交
222 223 224 225 226
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
W
wangguibao 已提交
227 228
```

229 230 231 232
- model_toolkit_path: The diectory path of model_toolkil.prototxt.
- model_toolkit_file: The file name of model_toolkil.prototxt.
- general_model_path: The diectory path of general_model.prototxt.
- general_model_file: The file name of general_model.prototxt.
233

S
ShiningZhang 已提交
234
#### 2.5 model_toolkit.prototxt
W
wangguibao 已提交
235

236 237 238
The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.

Example using cpu engine:
W
wangguibao 已提交
239

S
ShiningZhang 已提交
240
```
W
wangguibao 已提交
241
engines {
S
ShiningZhang 已提交
242 243 244
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
W
wangguibao 已提交
245
  reloadable_type: "timestamp_ne"
S
ShiningZhang 已提交
246 247
  model_dir: "uci_housing_model"
  gpu_ids: -1
248
  enable_memory_optimization: true
S
ShiningZhang 已提交
249 250 251 252 253 254 255 256 257 258 259
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
W
wangguibao 已提交
260 261 262
}
```

263 264 265 266
- name: The name of engine corresponding to the node name in workflow.prototxt.
- type: Only support ”PADDLE_INFER“
- reloadable_meta: Specify the mark file of reload. 
- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
W
wangguibao 已提交
267

268
|reloadable_type|Description|
W
wangguibao 已提交
269
|---------------|----|
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294
|timestamp_ne|when the mtime of reloadable_meta file changed|
|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
|md5sum|No use|
|revision|No use|

- model_dir: The path of model files.
- gpu_ids: Specify the gpu ids. Support multiple device ids:
```
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: Enable memory optimization.
- enable_ir_optimization: Enable ir optimization.
- use_trt: Enable Tensor RT. Need use_gpu on.
- use_lite: Enable PaddleLite.
- use_xpu: Enable KUNLUN XPU.
- use_gpu: Enbale GPU.
- combined_model: Enable combined model.
- gpu_multi_stream: Enable gpu multiple stream mode.
- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
- batch_infer_size: The max batch size of Async mode.
- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
- allow_split_request: Allow to split request task in Async mode.
S
ShiningZhang 已提交
295 296 297

#### 2.6 general_model.prototxt

298 299
The content of general_model.prototxt is same as serving_server_conf.prototxt.

S
ShiningZhang 已提交
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}
```
W
Fix doc  
wangguibao 已提交
316

S
ShiningZhang 已提交
317
## Python Pipeline
W
Fix doc  
wangguibao 已提交
318

S
ShiningZhang 已提交
319
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
320 321

Example of config.yaml:
S
ShiningZhang 已提交
322
```YAML
323
#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
S
ShiningZhang 已提交
324
rpc_port: 18090
W
wangguibao 已提交
325

326
#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
S
ShiningZhang 已提交
327
http_port: 9999
W
wangguibao 已提交
328

329 330 331
#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
S
ShiningZhang 已提交
332
worker_num: 20
W
wangguibao 已提交
333

334
#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
S
ShiningZhang 已提交
335
build_dag_each_worker: false
W
wangguibao 已提交
336

S
ShiningZhang 已提交
337
dag:
338
    #True, thread model;False,process model
S
ShiningZhang 已提交
339 340
    is_thread_op: False

341
    #retry times
S
ShiningZhang 已提交
342 343
    retry: 1

344
    # True,generate the TimeLine data;False
S
ShiningZhang 已提交
345 346 347 348 349 350
    use_profile: false
    tracer:
        interval_s: 10

op:
    det:
351
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
352 353
        concurrency: 6

354
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
355
        local_service_conf:
356
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
357 358
            client_type: local_predictor

359
            #det model path
S
ShiningZhang 已提交
360
            model_config: ocr_det_model
W
wangguibao 已提交
361

362
            #Fetch data list
S
ShiningZhang 已提交
363
            fetch_list: ["concat_1.tmp_0"]
W
wangguibao 已提交
364

365
            #Device ID
S
ShiningZhang 已提交
366 367 368 369 370 371 372 373 374 375 376
            devices: ""

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            #use_mkldnn
            #use_mkldnn: True

            #ir_optim
            ir_optim: True
    rec:
377
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
378 379
        concurrency: 3

380
        #time out, ms
S
ShiningZhang 已提交
381 382
        timeout: -1

383
        #retry times
S
ShiningZhang 已提交
384 385
        retry: 1

386
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
387 388
        local_service_conf:

389
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
390 391
            client_type: local_predictor

392
            #rec model path
S
ShiningZhang 已提交
393 394
            model_config: ocr_rec_model

395
            #Fetch data list
S
ShiningZhang 已提交
396 397
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

398
            #Device ID
S
ShiningZhang 已提交
399 400 401 402 403 404 405 406 407 408
            devices: ""

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

            #use_mkldnn
            #use_mkldnn: True

            #ir_optim
            ir_optim: True
W
wangguibao 已提交
409 410
```

411 412 413
### Single-machine and multi-card inference

Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
W
wangguibao 已提交
414

415
Reference config.yaml:
S
ShiningZhang 已提交
416
```YAML
417
#True, thread model;False,process model
S
ShiningZhang 已提交
418 419
is_thread_op: False

420
#concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
421 422 423
concurrency: 7

devices: "0,1,2"
424
```
S
ShiningZhang 已提交
425

426
### Heterogeneous Devices
S
ShiningZhang 已提交
427

428
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
S
ShiningZhang 已提交
429 430 431 432 433 434
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4

435
Reference config.yaml:
S
ShiningZhang 已提交
436
```YAML
437
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
S
ShiningZhang 已提交
438 439 440 441
device_type: 0
devices: "" # "0,1"
```

442
### Low precision inference
S
ShiningZhang 已提交
443

S
ShiningZhang 已提交
444
Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
S
ShiningZhang 已提交
445 446 447 448 449 450
- CPU
  - fp32(default)
  - fp16
  - bf16(mkldnn)
- GPU
  - fp32(default)
451
  - fp16(TRT effects)
S
ShiningZhang 已提交
452 453 454 455 456 457 458
  - int8
- Tensor RT
  - fp32(default)
  - fp16
  - int8 

```YAML
459 460 461
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
S
ShiningZhang 已提交
462 463
precision: "fp32"
```