Serving_Configure_EN.md 16.6 KB
Newer Older
S
ShiningZhang 已提交
1
# Serving Configuration
W
wangguibao 已提交
2

H
huangjianhui 已提交
3
([简体中文](./Serving_Configure_CN.md)|English)
W
wangguibao 已提交
4

5
## Overview
W
wangguibao 已提交
6

S
ShiningZhang 已提交
7
This guide focuses on Paddle C++ Serving and Python Pipeline configuration:
W
wangguibao 已提交
8

9
- [Model Configuration](#model-configuration): Auto generated when converting model. Specify model input/output.
S
ShiningZhang 已提交
10 11
- [C++ Serving](#c-serving): High-performance scenarios. Specify how to start quickly and start with user-defined configuration.
- [Python Pipeline](#python-pipeline): Multiple model combined scenarios.
W
wangguibao 已提交
12

13 14
## Model Configuration

H
huangjianhui 已提交
15
The model configuration is generated by converting PaddleServing model and named  serving_client_conf.prototxt/serving_server_conf.prototxt. It specifies the info of input/output so that users can fill parameters easily. The model configuration file should not be modified. See the [Saving guide](./Save_EN.md) for model converting. The model configuration file provided must be a `core/configure/proto/general_model_config.proto`.
16 17

Example:
S
ShiningZhang 已提交
18 19 20 21 22 23 24 25 26 27

```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
28 29
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
S
ShiningZhang 已提交
30 31
  is_lod_tensor: false
  fetch_type: 1
32 33 34
  shape: 3
  shape: 640
  shape: 640
W
wangguibao 已提交
35 36 37
}
```

38 39 40 41
- feed_var:model input
- fetch_var:model output
- name:node name
- alias_name:alias name
H
huangjianhui 已提交
42
- is_lod_tensor:lod tensor, ref to [Lod Introduction](./LOD_EN.md)
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
- feed_type/fetch_type:data type

|feed_type|类型|
|---------|----|
|0|INT64|
|1|FLOAT32|
|2|INT32|
|3|FP64|
|4|INT16|
|5|FP16|
|6|BF16|
|7|UINT8|
|8|INT8|

- shape:tensor shape
S
ShiningZhang 已提交
58

S
ShiningZhang 已提交
59
## C++ Serving
S
ShiningZhang 已提交
60

H
huangjianhui 已提交
61
### 1. Quick start and stop
S
ShiningZhang 已提交
62

S
ShiningZhang 已提交
63
The easiest way to start c++ serving is to provide the `--model` and `--port` flags.
S
ShiningZhang 已提交
64

S
ShiningZhang 已提交
65
Example starting c++ serving:
S
ShiningZhang 已提交
66 67 68 69
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```

70
This command will generate the server configuration files as `workdir_9393`:
S
ShiningZhang 已提交
71 72 73 74 75 76 77 78 79 80 81 82

```
workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt
```

83
More flags:
S
ShiningZhang 已提交
84 85 86 87 88 89 90 91 92 93
| Argument                                       | Type | Default | Description                                           |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread`                                       | int  | `2`     | Number of brpc service thread                         |
| `op_num`                                       | int[]| `0`     | Thread Number for each model in asynchronous mode     |
| `op_max_batch`                                 | int[]| `32`    | Batch Number for each model in asynchronous mode      |
| `gpu_ids`                                      | str[]| `"-1"`  | Gpu card id for each model                            |
| `port`                                         | int  | `9292`  | Exposed port of current service to users              |
| `model`                                        | str[]| `""`    | Path of paddle model directory to be served           |
| `mem_optim_off`                                | -    | -       | Disable memory / graphic memory optimization          |
| `ir_optim`                                     | bool | False   | Enable analysis and optimization of calculation graph |
S
ShiningZhang 已提交
94
| `use_mkl` (Only for cpu version)               | -    | -       | Run inference with MKL. Need open with ir_optim.                                |
95 96 97
| `use_trt` (Only for trt version)               | -    | -       | Run inference with TensorRT. Need open with ir_optim.                            |
| `use_lite` (Only for Intel x86 CPU or ARM CPU) | -    | -       | Run PaddleLite inference. Need open with ir_optim.                              |
| `use_xpu`                                      | -    | -       | Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.        |
S
ShiningZhang 已提交
98 99 100 101
| `precision`                                    | str  | FP32    | Precision Mode, support FP32, FP16, INT8              |
| `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
| `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |

102
#### Serving model with multiple gpus.
S
ShiningZhang 已提交
103 104 105
```BASH
python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2
```
106
#### Serving two models.
S
ShiningZhang 已提交
107 108 109
```BASH
python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292
```
H
huangjianhui 已提交
110 111 112 113 114
#### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to C++ Serving. When setting `kill`, SIGKILL will be sent to C++ Serving
S
ShiningZhang 已提交
115

116 117 118 119 120
### 2. Starting with user-defined Configuration

Mostly, the flags can meet the demand. However, the model configuration files can be modified by user that include service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf.

Example starting with self-defined config:
S
ShiningZhang 已提交
121 122 123 124 125 126 127

```BASH
/bin/serving --flagfile=proj.conf
```

#### 2.1 proj.conf

128
You can provide proj.conf with lots of flags:
S
ShiningZhang 已提交
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
```
# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt
```
147 148

The table below sets out the detailed description:
S
ShiningZhang 已提交
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
| name | Default | Description |
|------|--------|------|
|precision|"fp32"|Precision Mode, support FP32, FP16, INT8|
|use_calib|False|Only for deployment with TensorRT|
|reload_interval_s|10|Reload interval|
|max_concurrency|0|Limit of request processing in parallel, 0: unlimited|
|num_threads|10|Number of brpc service thread|
|bthread_concurrency|10|Number of bthread|
|max_body_size|536870912|Max size of brpc message|
|inferservice_path|"conf"|Path of inferservice conf|
|inferservice_file|"infer_service.prototxt"|Filename of inferservice conf|
|resource_path|"conf"|Path of resource conf|
|resource_file|"resource.prototxt"|Filename of resource conf|
|workflow_path|"conf"|Path of workflow conf|
|workflow_file|"workflow.prototxt"|Filename of workflow conf|

#### 2.2 service.prototxt

167
To set listening port, modify service.prototxt. You can set the `--inferservice_path` and `--inferservice_file` to instruct the server to check for service.prototxt. The `service.prototxt` file provided must be a `core/configure/server_configure.protobuf:InferServiceConf`.
S
ShiningZhang 已提交
168 169

```
W
wangguibao 已提交
170 171
port: 8010
services {
S
ShiningZhang 已提交
172
  name: "GeneralModelService"
W
wangguibao 已提交
173 174 175 176
  workflows: "workflow1"
}
```

177 178
- port: Listening port.
- services: No need to modify. The workflow1 is defined in workflow.prototxt.
W
wangguibao 已提交
179

S
ShiningZhang 已提交
180
#### 2.3 workflow.prototxt
W
wangguibao 已提交
181

182 183 184
To server user-defined OP, modify workflow.prototxt. You can set the `--workflow_path` and `--inferservice_file` to instruct the server to check for workflow.prototxt. The `workflow.prototxt` provided must be a `core/configure/server_configure.protobuf:Workflow`.

In the blow example, you are serving model with 3 OPs. The GeneralReaderOp converts the input data to tensor. The GeneralInferOp which depends the output of GeneralReaderOp predicts the tensor. The GeneralResponseOp return the output data.
W
wangguibao 已提交
185

S
ShiningZhang 已提交
186
```
W
wangguibao 已提交
187 188 189 190
workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
S
ShiningZhang 已提交
191 192
    name: "general_reader_0"
    type: "GeneralReaderOp"
W
wangguibao 已提交
193 194
  }
  nodes {
S
ShiningZhang 已提交
195 196
    name: "general_infer_0"
    type: "GeneralInferOp"
W
wangguibao 已提交
197
    dependencies {
S
ShiningZhang 已提交
198
      name: "general_reader_0"
W
wangguibao 已提交
199 200 201 202
      mode: "RO"
    }
  }
  nodes {
S
ShiningZhang 已提交
203 204
    name: "general_response_0"
    type: "GeneralResponseOp"
W
wangguibao 已提交
205
    dependencies {
S
ShiningZhang 已提交
206
      name: "general_infer_0"
W
wangguibao 已提交
207 208 209 210 211
      mode: "RO"
    }
  }
}
```
212 213 214 215 216 217 218 219

- name: The name of workflow.
- workflow_type: "Sequence"
- nodes: A workflow consists of nodes.
- node.name: The name of node. Corresponding to node type. Ref to `python/paddle_serving_server/dag.py`
- node.type: The bound operator. Ref to OPS in `serving/op`.
- node.dependencies: The list of upstream dependent operators.
- node.dependencies.name: The name of dependent operators.
220
- node.dependencies.mode: RO-Read Only, RW-Read Write
W
wangguibao 已提交
221

S
ShiningZhang 已提交
222
#### 2.4 resource.prototxt
W
wangguibao 已提交
223

224 225
You may modify resource.prototxt to set the path of model files. You can set the `--resource_path` and `--resource_file` to instruct the server to check for resource.prototxt. The `resource.prototxt` provided must be a `core/configure/server_configure.proto:Workflow`.

W
wangguibao 已提交
226

S
ShiningZhang 已提交
227 228 229 230 231
```
model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"
W
wangguibao 已提交
232 233
```

234 235 236 237
- model_toolkit_path: The diectory path of model_toolkil.prototxt.
- model_toolkit_file: The file name of model_toolkil.prototxt.
- general_model_path: The diectory path of general_model.prototxt.
- general_model_file: The file name of general_model.prototxt.
238

S
ShiningZhang 已提交
239
#### 2.5 model_toolkit.prototxt
W
wangguibao 已提交
240

241 242 243
The model_toolkit.prototxt specifies the parameters of predictor engines. The `model_toolkit.prototxt` provided must be a `core/configure/server_configure.proto:ModelToolkitConf`.

Example using cpu engine:
W
wangguibao 已提交
244

S
ShiningZhang 已提交
245
```
W
wangguibao 已提交
246
engines {
S
ShiningZhang 已提交
247 248 249
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
W
wangguibao 已提交
250
  reloadable_type: "timestamp_ne"
S
ShiningZhang 已提交
251 252
  model_dir: "uci_housing_model"
  gpu_ids: -1
253
  enable_memory_optimization: true
S
ShiningZhang 已提交
254 255 256 257 258 259 260 261 262 263 264
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
W
wangguibao 已提交
265 266 267
}
```

268 269 270 271
- name: The name of engine corresponding to the node name in workflow.prototxt.
- type: Only support ”PADDLE_INFER“
- reloadable_meta: Specify the mark file of reload. 
- reloadable_type: Support timestamp_ne/timestamp_gt/md5sum/revision/none
W
wangguibao 已提交
272

273
|reloadable_type|Description|
W
wangguibao 已提交
274
|---------------|----|
275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
|timestamp_ne|when the mtime of reloadable_meta file changed|
|timestamp_gt|When the mtime of reloadable_meta file greater than last record|
|md5sum|No use|
|revision|No use|

- model_dir: The path of model files.
- gpu_ids: Specify the gpu ids. Support multiple device ids:
```
# GPU0,1,2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2
```
- enable_memory_optimization: Enable memory optimization.
- enable_ir_optimization: Enable ir optimization.
- use_trt: Enable Tensor RT. Need use_gpu on.
- use_lite: Enable PaddleLite.
- use_xpu: Enable KUNLUN XPU.
- use_gpu: Enbale GPU.
- combined_model: Enable combined model.
- gpu_multi_stream: Enable gpu multiple stream mode.
- runtime_thread_num: Enable Async mode when num greater than 0 and creating predictors.
- batch_infer_size: The max batch size of Async mode.
- enable_overrun: Enable over running of Async mode which means putting the whole task into the task queue.
- allow_split_request: Allow to split request task in Async mode.
S
ShiningZhang 已提交
300 301 302

#### 2.6 general_model.prototxt

303 304
The content of general_model.prototxt is same as serving_server_conf.prototxt.

S
ShiningZhang 已提交
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
```
feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}
```
W
Fix doc  
wangguibao 已提交
321

S
ShiningZhang 已提交
322
## Python Pipeline
W
Fix doc  
wangguibao 已提交
323

H
huangjianhui 已提交
324 325 326 327 328 329 330 331 332 333 334 335 336
### Quick start and stop

Example starting Pipeline Serving:
```BASH
python3 -m paddle_serving_server.serve --model serving_model --port 9393
```
### Stop Serving.
```BASH
python3 -m paddle_serving_server.serve stop
```
`stop` sends SIGINT to Pipeline Serving. When setting `kill`, SIGKILL will be sent to Pipeline Serving

### yml Configuration
S
ShiningZhang 已提交
337
Python Pipeline provides a user-friendly programming framework for multi-model composite services.
338 339

Example of config.yaml:
S
ShiningZhang 已提交
340
```YAML
341
#RPC port. The RPC port and HTTP port cannot be empyt at the same time. If the RPC port is empty and the HTTP port is not empty, the RPC port is automatically set to HTTP port+1.
S
ShiningZhang 已提交
342
rpc_port: 18090
W
wangguibao 已提交
343

344
#HTTP port. The RPC port and the HTTP port cannot be empty at the same time. If the RPC port is available and the HTTP port is empty, the HTTP port is not automatically generated
S
ShiningZhang 已提交
345
http_port: 9999
W
wangguibao 已提交
346

347 348 349
#worker_num, the maximum concurrency.
#When build_dag_each_worker=True, server will create processes within GRPC Server ans DAG.
#When build_dag_each_worker=False, server will set the threadpool of GRPC.
S
ShiningZhang 已提交
350
worker_num: 20
W
wangguibao 已提交
351

352
#build_dag_each_worker, False,create process with DAG;True,create process with multiple independent DAG
S
ShiningZhang 已提交
353
build_dag_each_worker: false
W
wangguibao 已提交
354

S
ShiningZhang 已提交
355
dag:
356
    #True, thread model;False,process model
S
ShiningZhang 已提交
357 358
    is_thread_op: False

359
    #retry times
S
ShiningZhang 已提交
360 361
    retry: 1

362
    # True,generate the TimeLine data;False
S
ShiningZhang 已提交
363 364 365 366 367 368
    use_profile: false
    tracer:
        interval_s: 10

op:
    det:
369
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
370 371
        concurrency: 6

372
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
373
        local_service_conf:
374
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
375 376
            client_type: local_predictor

377
            #det model path
S
ShiningZhang 已提交
378
            model_config: ocr_det_model
W
wangguibao 已提交
379

380
            #Fetch data list
S
ShiningZhang 已提交
381
            fetch_list: ["concat_1.tmp_0"]
W
wangguibao 已提交
382

S
ShiningZhang 已提交
383 384 385
            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0

T
TeslaZhao 已提交
386 387 388
            #Device ID
            devices: ""

S
ShiningZhang 已提交
389
            #use_mkldnn, When running on mkldnn,must set ir_optim=True
S
ShiningZhang 已提交
390 391
            #use_mkldnn: True

T
TeslaZhao 已提交
392
            #ir_optim, When running on TensorRT,must set ir_optim=True
S
ShiningZhang 已提交
393
            ir_optim: True
T
TeslaZhao 已提交
394 395 396 397 398
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
S
ShiningZhang 已提交
399
    rec:
400
        #concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
401 402
        concurrency: 3

403
        #time out, ms
S
ShiningZhang 已提交
404 405
        timeout: -1

406
        #retry times
S
ShiningZhang 已提交
407 408
        retry: 1

409
        #Loading local server configuration without server_endpoints.
S
ShiningZhang 已提交
410 411
        local_service_conf:

412
            #client type,include brpc, grpc and local_predictor.
S
ShiningZhang 已提交
413 414
            client_type: local_predictor

415
            #rec model path
S
ShiningZhang 已提交
416 417
            model_config: ocr_rec_model

418
            #Fetch data list
S
ShiningZhang 已提交
419 420 421 422
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
            device_type: 0
T
TeslaZhao 已提交
423 424 425
            
            #Device ID
            devices: ""
S
ShiningZhang 已提交
426

S
ShiningZhang 已提交
427
            #use_mkldnn, When running on mkldnn,must set ir_optim=True
S
ShiningZhang 已提交
428 429
            #use_mkldnn: True

T
TeslaZhao 已提交
430
            #ir_optim, When running on TensorRT,must set ir_optim=True
S
ShiningZhang 已提交
431
            ir_optim: True
T
TeslaZhao 已提交
432 433 434 435 436
            
            #precsion, Decrease accuracy can increase speed
            #GPU 支持: "fp32"(default), "fp16", "int8";
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"
W
wangguibao 已提交
437 438
```

439 440 441
### Single-machine and multi-card inference

Single-machine multi-card inference can be abstracted into M OP processes bound to N GPU cards. It is related to the configuration of three parameters in config.yml. First, select the process mode, the number of concurrent processes is the number of processes, and devices is the GPU card ID.The binding method is to traverse the GPU card ID when the process starts, for example, start 7 OP processes, set devices:0,1,2 in config.yml, then the first, fourth, and seventh started processes are bound to the 0 card, and the second , 4 started processes are bound to 1 card, 3 and 6 processes are bound to card 2.
W
wangguibao 已提交
442

443
Reference config.yaml:
S
ShiningZhang 已提交
444
```YAML
445
#True, thread model;False,process model
S
ShiningZhang 已提交
446 447
is_thread_op: False

448
#concurrency,is_thread_op=True,thread otherwise process
S
ShiningZhang 已提交
449 450 451
concurrency: 7

devices: "0,1,2"
452
```
S
ShiningZhang 已提交
453

454
### Heterogeneous Devices
S
ShiningZhang 已提交
455

456
In addition to supporting CPU and GPU, Pipeline also supports the deployment of a variety of heterogeneous hardware. It consists of device_type and devices in config.yml. Use device_type to specify the type first, and judge according to devices when it is vacant. The device_type is described as follows:
S
ShiningZhang 已提交
457 458 459 460 461 462
- CPU(Intel) : 0
- GPU : 1
- TensorRT : 2
- CPU(Arm) : 3
- XPU : 4

463
Reference config.yaml:
S
ShiningZhang 已提交
464
```YAML
465
# device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu
S
ShiningZhang 已提交
466 467 468 469
device_type: 0
devices: "" # "0,1"
```

470
### Low precision inference
S
ShiningZhang 已提交
471

S
ShiningZhang 已提交
472
Python Pipeline supports low-precision inference. The precision types supported by CPU, GPU and TensoRT are shown in the figure below:
S
ShiningZhang 已提交
473 474 475 476 477 478
- CPU
  - fp32(default)
  - fp16
  - bf16(mkldnn)
- GPU
  - fp32(default)
479
  - fp16(TRT effects)
S
ShiningZhang 已提交
480 481 482 483 484 485 486
  - int8
- Tensor RT
  - fp32(default)
  - fp16
  - int8 

```YAML
487 488 489
#precsion
#GPU support: "fp32"(default), "fp16(TensorRT)", "int8";
#CPU support: "fp32"(default), "fp16", "bf16"(mkldnn); not support: "int8"
S
ShiningZhang 已提交
490
precision: "fp32"
H
huangjianhui 已提交
491
```