未验证 提交 fd1edf55 编写于 作者: J Jiawei Wang 提交者: GitHub

Merge pull request #1484 from HexToString/v0.7.0

cherry-pick #1483
...@@ -176,8 +176,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p ...@@ -176,8 +176,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | | ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread | | `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode | | `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `0` | Batch Number for each model in asynchronous mode | | `batch_infer_size` | int[]| `0` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | | `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users | | `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served | | `model` | str[]| `""` | Path of paddle model directory to be served |
...@@ -197,8 +197,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p ...@@ -197,8 +197,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p
In asynchronous mode, each model will start n threads of the number you specify, and each thread contains a model instance. In other words, each model is equivalent to a thread pool containing N threads, and the task is taken from the task queue of the thread pool to execute. In asynchronous mode, each model will start n threads of the number you specify, and each thread contains a model instance. In other words, each model is equivalent to a thread pool containing N threads, and the task is taken from the task queue of the thread pool to execute.
In asynchronous mode, each RPC server thread is only responsible for putting the request into the task queue of the model thread pool. After the task is executed, the completed task is removed from the task queue. In asynchronous mode, each RPC server thread is only responsible for putting the request into the task queue of the model thread pool. After the task is executed, the completed task is removed from the task queue.
In the above table, the number of RPC server threads is specified by --thread, and the default value is 2. In the above table, the number of RPC server threads is specified by --thread, and the default value is 2.
--op_num specifies the number of threads in the thread pool of each model. The default value is 0, indicating that asynchronous mode is not used. --runtime_thread_num specifies the number of threads in the thread pool of each model. The default value is 0, indicating that asynchronous mode is not used.
--op_max_batch specifies the number of batches for each model. The default value is 32. It takes effect when --op_num is not 0. --batch_infer_size specifies the number of batches for each model. The default value is 32. It takes effect when --runtime_thread_num is not 0.
#### When you want a model to use multiple GPU cards. #### When you want a model to use multiple GPU cards.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
#### When you want 2 models. #### When you want 2 models.
...@@ -206,7 +206,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_m ...@@ -206,7 +206,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_m
#### When you want 2 models, and want each of them use multiple GPU cards. #### When you want 2 models, and want each of them use multiple GPU cards.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### When a service contains two models, and each model needs to specify multiple GPU cards, and needs asynchronous mode, each model specifies different concurrency number. #### When a service contains two models, and each model needs to specify multiple GPU cards, and needs asynchronous mode, each model specifies different concurrency number.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8 python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8
</center> </center>
```python ```python
......
...@@ -175,8 +175,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p ...@@ -175,8 +175,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p
| Argument | Type | Default | Description | | Argument | Type | Default | Description |
| ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- | | ---------------------------------------------- | ---- | ------- | ----------------------------------------------------- |
| `thread` | int | `2` | Number of brpc service thread | | `thread` | int | `2` | Number of brpc service thread |
| `op_num` | int[]| `0` | Thread Number for each model in asynchronous mode | | `runtime_thread_num` | int[]| `0` | Thread Number for each model in asynchronous mode |
| `op_max_batch` | int[]| `32` | Batch Number for each model in asynchronous mode | | `batch_infer_size` | int[]| `32` | Batch Number for each model in asynchronous mode |
| `gpu_ids` | str[]| `"-1"` | Gpu card id for each model | | `gpu_ids` | str[]| `"-1"` | Gpu card id for each model |
| `port` | int | `9292` | Exposed port of current service to users | | `port` | int | `9292` | Exposed port of current service to users |
| `model` | str[]| `""` | Path of paddle model directory to be served | | `model` | str[]| `""` | Path of paddle model directory to be served |
...@@ -195,8 +195,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p ...@@ -195,8 +195,8 @@ python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --p
异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。 异步模式有助于提高Service服务的吞吐(QPS),但对于单次请求而言,时延会有少量增加。
异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。 异步模式中,每个模型会启动您指定个数的N个线程,每个线程中包含一个模型实例,换句话说每个模型相当于包含N个线程的线程池,从线程池的任务队列中取任务来执行。
异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。 异步模式中,各个RPC Server的线程只负责将Request请求放入模型线程池的任务队列中,等任务被执行完毕后,再从任务队列中取出已完成的任务。
上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--op_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。 上表中通过 --thread 10 指定的是RPC Server的线程数量,默认值为2,--runtime_thread_num 指定的是各个模型的线程池中线程数N,默认值为0,表示不使用异步模式。
--op_max_batch 指定的各个模型的batch数量,默认值为32,该参数只有当--op_num不为0时才生效。 --batch_infer_size 指定的各个模型的batch数量,默认值为32,该参数只有当--runtime_thread_num不为0时才生效。
#### 当您的某个模型想使用多张GPU卡部署时. #### 当您的某个模型想使用多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2 python3 -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292 --gpu_ids 0,1,2
...@@ -205,7 +205,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_m ...@@ -205,7 +205,7 @@ python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_m
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时. #### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡部署时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2
#### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时. #### 当您的一个服务包含两个模型,且每个模型都需要指定多张GPU卡,且需要异步模式每个模型指定不同的并发数时.
python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --op_num 4 8 python3 -m paddle_serving_server.serve --model uci_housing_model_1 uci_housing_model_2 --thread 10 --port 9292 --gpu_ids 0,1 1,2 --runtime_thread_num 4 8
</center> </center>
......
...@@ -109,7 +109,12 @@ def is_gpu_mode(unformatted_gpus): ...@@ -109,7 +109,12 @@ def is_gpu_mode(unformatted_gpus):
def serve_args(): def serve_args():
parser = argparse.ArgumentParser("serve") parser = argparse.ArgumentParser("serve")
parser.add_argument("server", type=str, default="start",nargs="?", help="stop or start PaddleServing") parser.add_argument(
"server",
type=str,
default="start",
nargs="?",
help="stop or start PaddleServing")
parser.add_argument( parser.add_argument(
"--thread", "--thread",
type=int, type=int,
...@@ -123,9 +128,13 @@ def serve_args(): ...@@ -123,9 +128,13 @@ def serve_args():
parser.add_argument( parser.add_argument(
"--gpu_ids", type=str, default="", nargs="+", help="gpu ids") "--gpu_ids", type=str, default="", nargs="+", help="gpu ids")
parser.add_argument( parser.add_argument(
"--op_num", type=int, default=0, nargs="+", help="Number of each op") "--runtime_thread_num",
type=int,
default=0,
nargs="+",
help="Number of each op")
parser.add_argument( parser.add_argument(
"--op_max_batch", "--batch_infer_size",
type=int, type=int,
default=32, default=32,
nargs="+", nargs="+",
...@@ -251,11 +260,11 @@ def start_gpu_card_model(gpu_mode, port, args): # pylint: disable=doc-string-mi ...@@ -251,11 +260,11 @@ def start_gpu_card_model(gpu_mode, port, args): # pylint: disable=doc-string-mi
if args.gpu_multi_stream and device == "gpu": if args.gpu_multi_stream and device == "gpu":
server.set_gpu_multi_stream() server.set_gpu_multi_stream()
if args.op_num: if args.runtime_thread_num:
server.set_op_num(args.op_num) server.set_runtime_thread_num(args.runtime_thread_num)
if args.op_max_batch: if args.batch_infer_size:
server.set_op_max_batch(args.op_max_batch) server.set_batch_infer_size(args.batch_infer_size)
if args.use_lite: if args.use_lite:
server.set_lite() server.set_lite()
...@@ -370,7 +379,7 @@ class MainService(BaseHTTPRequestHandler): ...@@ -370,7 +379,7 @@ class MainService(BaseHTTPRequestHandler):
self.wfile.write(json.dumps(response).encode()) self.wfile.write(json.dumps(response).encode())
def stop_serving(command : str, port : int = None): def stop_serving(command: str, port: int=None):
''' '''
Stop PaddleServing by port. Stop PaddleServing by port.
...@@ -400,7 +409,7 @@ def stop_serving(command : str, port : int = None): ...@@ -400,7 +409,7 @@ def stop_serving(command : str, port : int = None):
start_time = info["start_time"] start_time = info["start_time"]
if port is not None: if port is not None:
if port in storedPort: if port in storedPort:
kill_stop_process_by_pid(command ,pid) kill_stop_process_by_pid(command, pid)
infoList.remove(info) infoList.remove(info)
if len(infoList): if len(infoList):
with open(filepath, "w") as fp: with open(filepath, "w") as fp:
...@@ -411,16 +420,17 @@ def stop_serving(command : str, port : int = None): ...@@ -411,16 +420,17 @@ def stop_serving(command : str, port : int = None):
else: else:
if lastInfo == info: if lastInfo == info:
raise ValueError( raise ValueError(
"Please confirm the port [%s] you specified is correct." % "Please confirm the port [%s] you specified is correct."
port) % port)
else: else:
pass pass
else: else:
kill_stop_process_by_pid(command ,pid) kill_stop_process_by_pid(command, pid)
if lastInfo == info: if lastInfo == info:
os.remove(filepath) os.remove(filepath)
return True return True
if __name__ == "__main__": if __name__ == "__main__":
# args.device is not used at all. # args.device is not used at all.
# just keep the interface. # just keep the interface.
......
...@@ -82,8 +82,8 @@ class Server(object): ...@@ -82,8 +82,8 @@ class Server(object):
self.mkl_flag = False self.mkl_flag = False
self.device = "cpu" self.device = "cpu"
self.gpuid = [] self.gpuid = []
self.op_num = [0] self.runtime_thread_num = [0]
self.op_max_batch = [32] self.batch_infer_size = [32]
self.use_trt = False self.use_trt = False
self.gpu_multi_stream = False self.gpu_multi_stream = False
self.use_lite = False self.use_lite = False
...@@ -171,11 +171,11 @@ class Server(object): ...@@ -171,11 +171,11 @@ class Server(object):
def set_gpuid(self, gpuid): def set_gpuid(self, gpuid):
self.gpuid = format_gpu_to_strlist(gpuid) self.gpuid = format_gpu_to_strlist(gpuid)
def set_op_num(self, op_num): def set_runtime_thread_num(self, runtime_thread_num):
self.op_num = op_num self.runtime_thread_num = runtime_thread_num
def set_op_max_batch(self, op_max_batch): def set_batch_infer_size(self, batch_infer_size):
self.op_max_batch = op_max_batch self.batch_infer_size = batch_infer_size
def set_trt(self): def set_trt(self):
self.use_trt = True self.use_trt = True
...@@ -205,15 +205,15 @@ class Server(object): ...@@ -205,15 +205,15 @@ class Server(object):
else: else:
self.gpuid = ["-1"] self.gpuid = ["-1"]
if isinstance(self.op_num, int): if isinstance(self.runtime_thread_num, int):
self.op_num = [self.op_num] self.runtime_thread_num = [self.runtime_thread_num]
if len(self.op_num) == 0: if len(self.runtime_thread_num) == 0:
self.op_num.append(0) self.runtime_thread_num.append(0)
if isinstance(self.op_max_batch, int): if isinstance(self.batch_infer_size, int):
self.op_max_batch = [self.op_max_batch] self.batch_infer_size = [self.batch_infer_size]
if len(self.op_max_batch) == 0: if len(self.batch_infer_size) == 0:
self.op_max_batch.append(32) self.batch_infer_size.append(32)
index = 0 index = 0
...@@ -224,9 +224,10 @@ class Server(object): ...@@ -224,9 +224,10 @@ class Server(object):
engine.reloadable_meta = model_config_path + "/fluid_time_file" engine.reloadable_meta = model_config_path + "/fluid_time_file"
os.system("touch {}".format(engine.reloadable_meta)) os.system("touch {}".format(engine.reloadable_meta))
engine.reloadable_type = "timestamp_ne" engine.reloadable_type = "timestamp_ne"
engine.runtime_thread_num = self.op_num[index % len(self.op_num)] engine.runtime_thread_num = self.runtime_thread_num[index % len(
engine.batch_infer_size = self.op_max_batch[index % self.runtime_thread_num)]
len(self.op_max_batch)] engine.batch_infer_size = self.batch_infer_size[index % len(
self.batch_infer_size)]
engine.enable_overrun = False engine.enable_overrun = False
engine.allow_split_request = True engine.allow_split_request = True
......
...@@ -133,8 +133,8 @@ class WebService(object): ...@@ -133,8 +133,8 @@ class WebService(object):
use_calib=False, use_calib=False,
use_trt=False, use_trt=False,
gpu_multi_stream=False, gpu_multi_stream=False,
op_num=None, runtime_thread_num=None,
op_max_batch=None): batch_infer_size=None):
device = "cpu" device = "cpu"
server = Server() server = Server()
...@@ -187,11 +187,11 @@ class WebService(object): ...@@ -187,11 +187,11 @@ class WebService(object):
if gpu_multi_stream and device == "gpu": if gpu_multi_stream and device == "gpu":
server.set_gpu_multi_stream() server.set_gpu_multi_stream()
if op_num: if runtime_thread_num:
server.set_op_num(op_num) server.set_runtime_thread_num(runtime_thread_num)
if op_max_batch: if batch_infer_size:
server.set_op_max_batch(op_max_batch) server.set_batch_infer_size(batch_infer_size)
if use_lite: if use_lite:
server.set_lite() server.set_lite()
...@@ -225,8 +225,8 @@ class WebService(object): ...@@ -225,8 +225,8 @@ class WebService(object):
use_calib=self.use_calib, use_calib=self.use_calib,
use_trt=self.use_trt, use_trt=self.use_trt,
gpu_multi_stream=self.gpu_multi_stream, gpu_multi_stream=self.gpu_multi_stream,
op_num=self.op_num, runtime_thread_num=self.runtime_thread_num,
op_max_batch=self.op_max_batch)) batch_infer_size=self.batch_infer_size))
def prepare_server(self, def prepare_server(self,
workdir, workdir,
...@@ -241,8 +241,8 @@ class WebService(object): ...@@ -241,8 +241,8 @@ class WebService(object):
mem_optim=True, mem_optim=True,
use_trt=False, use_trt=False,
gpu_multi_stream=False, gpu_multi_stream=False,
op_num=None, runtime_thread_num=None,
op_max_batch=None, batch_infer_size=None,
gpuid=None): gpuid=None):
print("This API will be deprecated later. Please do not use it") print("This API will be deprecated later. Please do not use it")
self.workdir = workdir self.workdir = workdir
...@@ -259,9 +259,8 @@ class WebService(object): ...@@ -259,9 +259,8 @@ class WebService(object):
self.port_list = [] self.port_list = []
self.use_trt = use_trt self.use_trt = use_trt
self.gpu_multi_stream = gpu_multi_stream self.gpu_multi_stream = gpu_multi_stream
self.op_num = op_num self.runtime_thread_num = runtime_thread_num
self.op_max_batch = op_max_batch self.batch_infer_size = batch_infer_size
# record port and pid info for stopping process # record port and pid info for stopping process
dump_pid_file([self.port], "web_service") dump_pid_file([self.port], "web_service")
# if gpuid != None, we will use gpuid first. # if gpuid != None, we will use gpuid first.
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册