Serving_Configure_EN.md 17.3 KB
Newer Older
S
ShiningZhang 已提交
1
# Serving Configuration
W
wangguibao 已提交
2

H
huangjianhui 已提交
3
([简体中文](./Serving_Configure_CN.md)|English)
W
wangguibao 已提交
4

5
## Overview
W
wangguibao 已提交
6

S
ShiningZhang 已提交
7
This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
W
wangguibao 已提交
8

9
- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
S
ShiningZhang 已提交
10 11
- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.
W
wangguibao 已提交
12

13 14
## Model Configuration

H
huangjianhui 已提交
15
The model configuration is generated by converting PaddleServing model and named  serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](./Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
16 17

Example:
S
ShiningZhang 已提交
18 19 20 21 22 23 24 25 26 27

```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
28 29
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
S
ShiningZhang 已提交
30 31
  is_lod_tensor: false
  fetch_type: 1
32 33 34
  shape: 3
  shape: 640
  shape: 640
W
wangguibao 已提交
35 36 37
}
```

38 39 40 41
- feed_var:model input
- fetch_var:model output
- name:node name
- alias_name:alias name
H
huangjianhui 已提交
42
- is_lod_tensor:lod tensor, ref to [Lod Introduction](./LOD_EN.md)
43 44 45 46 47 48 49 50 51 52 53 54 55
- feed_type/fetch_type:data type

|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|
S
ShiningZhang 已提交
56
|20|STRING|
57 58

- shape:tensor shape
S
ShiningZhang 已提交
59

S
ShiningZhang 已提交
60
## C++ Serving
S
ShiningZhang 已提交
61

H
huangjianhui 已提交
62
### 1. Quick start and stop
S
ShiningZhang 已提交
63

S
ShiningZhang 已提交
64
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
S
ShiningZhang 已提交
65

S
ShiningZhang 已提交
66
Example starting c++ serving:
S
ShiningZhang 已提交
67 68 69 70
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```

71
This command will generate the server configuration files as `workdir_9393`:
S
ShiningZhang 已提交
72 73 74 75 76 77 78 79 80 81 82 83

```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```

84
More flags:
S
ShiningZhang 已提交
85 86 87
| Argument                                       | Type | Default | Description                                           |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
T
fix  
Thomas Young 已提交
88 89
| `runtime_thread_num`                           | int[]| `0`     | Thread Number for each model in asynchronous mode     |
| `batch_infer_size`                             | int[]| `32`    | Batch Number for each model in asynchronous mode      |
S
ShiningZhang 已提交
90 91 92 93 94
| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
S
ShiningZhang 已提交
95
| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL. Need open with ir_optim.                                |
96 97 98
| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                            |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
S
ShiningZhang 已提交
99 100 101
| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
S
ShiningZhang 已提交
102
| `use_ascend_cl`                                | bool | False   | Enable for ascend910; Use with use_lite for ascend310 |
S
ShiningZhang 已提交
103
| `request_cache_size`                           | int  | `0`     | Bytes size of request cache. By default, the cache is disabled |
S
ShiningZhang 已提交
104

105
#### Serving model with multiple gpus.
S
ShiningZhang 已提交
106 107 108
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
109
#### Serving two models.
S
ShiningZhang 已提交
110 111 112
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
H
huangjianhui 已提交
113
#### Stop Serving(execute the following command in the directory where start serving or the path which environment variable SERVING_HOME set).
H
huangjianhui 已提交
114 115 116 117
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to C++ Serving. When setting `kill`, SIGKILL will be sent to C++ Serving
S
ShiningZhang 已提交
118

119 120 121 122 123
### 2. Starting with user-defined Configuration

Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.

Example starting with self-defined config:
S
ShiningZhang 已提交
124 125 126 127 128 129 130

```BASH
/bin/serving --flagfile=proj.conf
```

#### 2.1 proj.conf

131
You can provide proj.conf with lots of flags:
S
ShiningZhang 已提交
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
150 151

The table below sets out the detailed description:
S
ShiningZhang 已提交
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|

#### 2.2 service.prototxt

170
To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
S
ShiningZhang 已提交
171 172

```
W
wangguibao 已提交
173 174
port: 8010
services {
S
ShiningZhang 已提交
175
  name: "GeneralModelService"
W
wangguibao 已提交
176 177 178 179
  workflows: "workflow1"
}
```

180 181
- port: Listening port.
- services: No need to modify. The workflow1 is defined in workflow.prototxt.
W
wangguibao 已提交
182

S
ShiningZhang 已提交
183
#### 2.3 workflow.prototxt
W
wangguibao 已提交
184

185 186 187
To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.

In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.
W
wangguibao 已提交
188

S
ShiningZhang 已提交
189
```
W
wangguibao 已提交
190 191 192 193
workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
S
ShiningZhang 已提交
194 195
    name: "general_reader_0"
    type: "GeneralReaderOp"
W
wangguibao 已提交
196 197
  }
  nodes {
S
ShiningZhang 已提交
198 199
    name: "general_infer_0"
    type: "GeneralInferOp"
W
wangguibao 已提交
200
    dependencies {
S
ShiningZhang 已提交
201
      name: "general_reader_0"
W
wangguibao 已提交
202 203 204 205
      mode: "RO"
    }
  }
  nodes {
S
ShiningZhang 已提交
206 207
    name: "general_response_0"
    type: "GeneralResponseOp"
W
wangguibao 已提交
208
    dependencies {
S
ShiningZhang 已提交
209
      name: "general_infer_0"
W
wangguibao 已提交
210 211 212 213 214
      mode: "RO"
    }
  }
}
```
215 216 217 218 219 220 221 222

- name: The name of workflow.
- workflow_type: "Sequence"
- nodes: A workflow consists of nodes.
- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
- node.type: The bound operator. Ref to OPS in `serving/op`.
- node.dependencies: The list of upstream dependent operators.
- node.dependencies.name: The name of dependent operators.
223
- node.dependencies.mode: RO-Read Only, RW-Read Write
W
wangguibao 已提交
224

S
ShiningZhang 已提交
225
#### 2.4 resource.prototxt
W
wangguibao 已提交
226

227 228
You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.

W
wangguibao 已提交
229

S
ShiningZhang 已提交
230 231 232 233 234
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
W
wangguibao 已提交
235 236
```

237 238 239 240
- model_toolkit_path: The diectory path of model_toolkil.prototxt.
- model_toolkit_file: The file name of model_toolkil.prototxt.
- general_model_path: The diectory path of general_model.prototxt.
- general_model_file: The file name of general_model.prototxt.
241

S
ShiningZhang 已提交
242
#### 2.5 model_toolkit.prototxt
W
wangguibao 已提交
243

244 245 246
The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.

Example using cpu engine:
W
wangguibao 已提交
247

S
ShiningZhang 已提交
248
```
W
wangguibao 已提交
249
engines {
S
ShiningZhang 已提交
250 251 252
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
W
wangguibao 已提交
253
  reloadable_type: "timestamp_ne"
S
ShiningZhang 已提交
254 255
  model_dir: "uci_housing_model"
  gpu_ids: -1
256
  enable_memory_optimization: true
S
ShiningZhang 已提交
257 258 259 260 261 262 263
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
S
ShiningZhang 已提交
264
  use_ascend_cl: false
S
ShiningZhang 已提交
265 266 267 268
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
W
wangguibao 已提交
269 270 271
}
```

272 273 274 275
- name: The name of engine corresponding to the node name in workflow.prototxt.
- type: Only support ”PADDLE_INFER“
- reloadable_meta: Specify the mark file of reload. 
- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
W
wangguibao 已提交
276

277
|reloadable_type|Description|
W
wangguibao 已提交
278
|---------------|----|
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
|timestamp_ne|when the mtime of reloadable_meta file changed|
|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
|md5sum|No use|
|revision|No use|

- model_dir: The path of model files.
- gpu_ids: Specify the gpu ids. Support multiple device ids:
```
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: Enable memory optimization.
- enable_ir_optimization: Enable ir optimization.
- use_trt: Enable Tensor RT. Need use_gpu on.
- use_lite: Enable PaddleLite.
- use_xpu: Enable KUNLUN XPU.
- use_gpu: Enbale GPU.
- combined_model: Enable combined model.
- gpu_multi_stream: Enable gpu multiple stream mode.
S
ShiningZhang 已提交
300
- use_ascend_cl: Enable Ascend, use individually for ascend910, use with lite for ascend310
301 302 303 304
- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
- batch_infer_size: The max batch size of Async mode.
- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
- allow_split_request: Allow to split request task in Async mode.
S
ShiningZhang 已提交
305 306 307

#### 2.6 general_model.prototxt

308 309
The content of general_model.prototxt is same as serving_server_conf.prototxt.

S
ShiningZhang 已提交
310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325
```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}
```
W
Fix doc  
wangguibao 已提交
326

S
ShiningZhang 已提交
327
## Python Pipeline
W
Fix doc  
wangguibao 已提交
328

H
huangjianhui 已提交
329 330 331 332
### Quick start and stop

Example starting Pipeline Serving:
```BASH
H
huangjianhui 已提交
333
python3 web_service.py
H
huangjianhui 已提交
334
```
H
huangjianhui 已提交
335
### Stop Serving(execute the following command in the directory where start Pipeline serving or the path which environment variable SERVING_HOME set).
H
huangjianhui 已提交
336 337 338 339 340 341
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to Pipeline Serving. When setting `kill`, SIGKILL will be sent to Pipeline Serving

### yml Configuration
S
ShiningZhang 已提交
342
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
343 344

Example of config.yaml:
S
ShiningZhang 已提交
345
```YAML
346
#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
S
ShiningZhang 已提交
347
rpc_port: 18090
W
wangguibao 已提交
348

349
#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
S
ShiningZhang 已提交
350
http_port: 9999
W
wangguibao 已提交
351

352 353 354
#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
S
ShiningZhang 已提交
355
worker_num: 20
W
wangguibao 已提交
356

357
#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
S
ShiningZhang 已提交
358
build_dag_each_worker: false
W
wangguibao 已提交
359

S
ShiningZhang 已提交
360
dag:
361
    #True, thread model;False,process model
S
ShiningZhang 已提交
362 363
    is_thread_op: False

364
    #retry times
S
ShiningZhang 已提交
365 366
    retry: 1

367
    # True,generate the TimeLine data;False
S
ShiningZhang 已提交
368 369 370 371 372 373
    use_profile: false
    tracer:
        interval_s: 10

op:
    det:
374
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
375 376
        concurrency: 6

377
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
378
        local_service_conf:
379
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
380 381
            client_type: local_predictor

382
            #det model path
S
ShiningZhang 已提交
383
            model_config: ocr_det_model
W
wangguibao 已提交
384

385
            #Fetch data list
S
ShiningZhang 已提交
386
            fetch_list: ["concat_1.tmp_0"]
W
wangguibao 已提交
387

S
ShiningZhang 已提交
388
            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
S
ShiningZhang 已提交
389 390
            device_type: 0

T
TeslaZhao 已提交
391 392 393
            #Device ID
            devices: ""

S
ShiningZhang 已提交
394
            #use_mkldnn, When running on mkldnn,must set ir_optim=True
S
ShiningZhang 已提交
395 396
            #use_mkldnn: True

T
TeslaZhao 已提交
397
            #ir_optim, When running on TensorRT,must set ir_optim=True
S
ShiningZhang 已提交
398
            ir_optim: True
T
TeslaZhao 已提交
399 400 401 402 403
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
S
ShiningZhang 已提交
404
    rec:
405
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
406 407
        concurrency: 3

408
        #time out, ms
S
ShiningZhang 已提交
409 410
        timeout: -1

411
        #retry times
S
ShiningZhang 已提交
412 413
        retry: 1

414
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
415 416
        local_service_conf:

417
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
418 419
            client_type: local_predictor

420
            #rec model path
S
ShiningZhang 已提交
421 422
            model_config: ocr_rec_model

423
            #Fetch data list
S
ShiningZhang 已提交
424 425
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

S
ShiningZhang 已提交
426
            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
S
ShiningZhang 已提交
427
            device_type: 0
T
TeslaZhao 已提交
428 429 430
            
            #Device ID
            devices: ""
S
ShiningZhang 已提交
431

S
ShiningZhang 已提交
432
            #use_mkldnn, When running on mkldnn,must set ir_optim=True
S
ShiningZhang 已提交
433 434
            #use_mkldnn: True

T
TeslaZhao 已提交
435
            #ir_optim, When running on TensorRT,must set ir_optim=True
S
ShiningZhang 已提交
436
            ir_optim: True
T
TeslaZhao 已提交
437 438 439 440 441
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
W
wangguibao 已提交
442 443
```

444 445 446
### Single-machine and multi-card inference

Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
W
wangguibao 已提交
447

448
Reference config.yaml:
S
ShiningZhang 已提交
449
```YAML
450
#True, thread model;False,process model
S
ShiningZhang 已提交
451 452
is_thread_op: False

453
#concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
454 455 456
concurrency: 7

devices: "0,1,2"
457
```
S
ShiningZhang 已提交
458

459
### Heterogeneous Devices
S
ShiningZhang 已提交
460

461
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
S
ShiningZhang 已提交
462 463 464 465 466
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4
S
ShiningZhang 已提交
467 468
- Ascend310(Arm) : 5
- Ascend910(Arm) : 6
S
ShiningZhang 已提交
469

470
Reference config.yaml:
S
ShiningZhang 已提交
471
```YAML
S
ShiningZhang 已提交
472
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
S
ShiningZhang 已提交
473 474 475 476
device_type: 0
devices: "" # "0,1"
```

477
### Low precision inference
S
ShiningZhang 已提交
478

S
ShiningZhang 已提交
479
Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
S
ShiningZhang 已提交
480 481 482 483 484 485
- CPU
  - fp32(default)
  - fp16
  - bf16(mkldnn)
- GPU
  - fp32(default)
486
  - fp16(TRT effects)
S
ShiningZhang 已提交
487 488 489 490 491 492 493
  - int8
- Tensor RT
  - fp32(default)
  - fp16
  - int8 

```YAML
494 495 496
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
S
ShiningZhang 已提交
497
precision: "fp32"
H
huangjianhui 已提交
498
```