add grpc impl (like web service) (!654) · 合并请求 · PaddlePaddle / Serving

add grpc impl (like web service) !654

Created by: barrierye

兼容性测试

为了适配macOS和Windows，需要更新from .serving_client import PredictorRes位置

CentOS
macOS
Windows

性能测试（基于gpu的ernie服务测试）

有两种pb数据结构，可以通过Request中的is_python来配置：

普通版本使用基本数据单元，用于支持多语言客户端，但性能较差（比原bprc版增加147%耗时）
narray版本需要借助numpy，存储的是narray.tobytes()，优点是性能与原bprc相差不大（只增加了6%），但narray的bytes是特殊编码故该版本目前只能在Python中使用。

普通版本

执行1k次预测总耗时增加了~407%(3.09s -> 15.67s)~ 147%(30.73s -> 75.84s) 对于单次预测：

多语言客户端前处理部分增加了 ~230.1 us~ 84.2 us
多语言客户端inference部分增加了 ~111.777 ms~ 31.335 ms
多语言客户端后处理部分增加了 14.124 ms grpc client接口：

Timer unit: 1e-06 s
Total time: 75.8425 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   494                                               def predict(self, feed, fetch, need_variant_tag=False, asyn=False):
   495                                              #lp = LineProfiler()
   496                                              #lp_wrapper = lp(self._unpack_resp)
   497      1000      84178.0     84.2      0.1          req = self._pack_feed_data(feed, fetch)
   498      1000        729.0      0.7      0.0          if not asyn:
   499      1000   61633144.0  61633.1     81.3              resp = self.stub_.inference(req)
   500      1000   14124419.0  14124.4     18.6              return self._unpack_resp(resp, fetch, need_variant_tag)
   501                                                       #resp = lp_wrapper(resp, fetch, need_variant_tag)
   502                                                  #lp.print_stats()
   503                                                  return resp
   504                                                   else:
   505                                                       call_future = self.stub_.inference.future(req)
   506                                                       return MultiLangPredictFuture(
   507                                                           call_future, self._done_callback_func(fetch, need_variant_tag))

原client接口：

Timer unit: 1e-06 s
Total time: 30.7256 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   314      1000       2118.0      2.1      0.0              res = self.client_handle_.numpy_predict(
   315      1000       1659.0      1.7      0.0                  float_slot_batch, float_feed_names, float_shape, int_slot_batch,
   316      1000       1394.0      1.4      0.0                  int_feed_names, int_shape, fetch_names, result_batch_handle,
   317      1000   30298326.0  30298.3     98.6                  self.pid)

narray版本

执行1k次预测总耗时增加了~3%(35.48s -> 36.60s)~ 6%(30.54s -> 32.37s) 对于单次预测：

多语言客户端前处理部分增加了 ~61.0 us~ 73.0 us
多语言客户端inference部分增加了 ~1.323 ms~ 1.986 ms
多语言客户端后处理部分增加了 ~158.1 us~ 193.8 us

grpc client接口：

Timer unit: 1e-06 s
Total time: 32.3739 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   514                                               def predict(self,
   515                                                           feed,
   516                                                           fetch,
   517                                                           need_variant_tag=False,
   518                                                           asyn=False,
   519                                                           is_python=True):
   520      1000      72999.0     73.0      0.2          req = self._pack_feed_data(feed, fetch, is_python=is_python)
   521      1000        743.0      0.7      0.0          if not asyn:
   522      1000   32102981.0  32103.0     99.2              resp = self.stub_.inference(req)
   523      1000       1805.0      1.8      0.0              return self._unpack_resp(
   524      1000        487.0      0.5      0.0                  resp,
   525      1000        552.0      0.6      0.0                  fetch,
   526      1000        470.0      0.5      0.0                  is_python=is_python,
   527      1000     193834.0    193.8      0.6                  need_variant_tag=need_variant_tag)
   528                                                   else:
   529                                                       call_future = self.stub_.inference.future(req)
   530                                                       return MultiLangPredictFuture(
   531                                                           call_future,
   532                                                           self._done_callback_func(
   533                                                               fetch,
   534                                                               is_python=is_python,
   535                                                               need_variant_tag=need_variant_tag))