diff --git a/doc/C++_Serving/Request_Cache_CN.md b/doc/C++_Serving/Request_Cache_CN.md
new file mode 100644
index 0000000000000000000000000000000000000000..12322cfa87a41a4c251471ecb1a5d21ce37edd3e
--- /dev/null
+++ b/doc/C++_Serving/Request_Cache_CN.md
@@ -0,0 +1,15 @@
+# Request Cache
+
+本文主要介绍请求缓存功能及实现原理。
+
+服务中请求由张量tensor、结果名称fetch_var_names、调试开关profile_server、标识码log_id组成，预测结果包含输出张量等。这里缓存会保存请求与结果的键值对。当请求命中缓存时，服务不会执行模型预测，而是会直接从缓存中提取结果。对于某些特定场景而言，这能显著降低请求耗时。
+
+缓存可以通过设置`--request_cache_size`来开启。该标志默认为0，即不开启缓存。当设置非零值时，服务会以设置大小为存储上限开启缓存。这里设置的内存单位为字节。注意，如果设置`--request_cache_size`为0是不能开启缓存的。
+
+缓存中的键为64位整形数，是由请求中的tensor和fetch_var_names数据生成的128位哈希值。如果请求命中，那么对应的处理结果会提取出来用于构建响应数据。如果请求没有命中，服务则会执行模型预测，在返回结果的同时将处理结果放入缓存中。由于缓存设置了存储上限，因此需要淘汰机制来限制缓存容量。当前，服务采用了最近最少使用（LRU）机制用于淘汰缓存数据。
+
+## 注意事项
+
+ - 只有预测成功的请求会进行缓存。如果请求失败或者在预测过程中返回错误，则处理结果不会缓存。
+ - 缓存是基于请求数据的哈希值实现。因此，可能会出现两个不同的请求生成了相同的哈希值即哈希碰撞，这时服务可能会返回错误的响应数据。哈希值为64位数据，发生哈希碰撞的可能性较小。
+ - 不论使用同步模式还是异步模式，均可以正常使用缓存功能。
diff --git a/doc/Serving_Configure_CN.md b/doc/Serving_Configure_CN.md
index 9564dcbd51f7e280cd19c13f71885c5b9fcc2064..4288970afdc8df87558ad6b8a01f630b94df63c8 100644
--- a/doc/Serving_Configure_CN.md
+++ b/doc/Serving_Configure_CN.md
@@ -100,6 +100,7 @@ workdir_9393
 | `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
 | `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
 | `use_ascend_cl`                                | bool | False   | Enable for ascend910; Use with use_lite for ascend310 |
+| `request_cache_size`                           | int  | `0`     | Bytes size of request cache. By default, the cache is disabled |
 
 #### 当您的某个模型想使用多张GPU卡部署时.
 ```BASH
diff --git a/doc/Serving_Configure_EN.md b/doc/Serving_Configure_EN.md
index 04c4ad18fb54192bad587feff04635f4e7a1e6d7..68c52cffe690d8a97512adef0a4c073ffa23824b 100644
--- a/doc/Serving_Configure_EN.md
+++ b/doc/Serving_Configure_EN.md
@@ -100,6 +100,7 @@ More flags:
 | `use_calib`                                    | bool | False   | Use TRT int8 calibration                              |
 | `gpu_multi_stream`                             | bool | False   | EnableGpuMultiStream to get larger QPS                |
 | `use_ascend_cl`                                | bool | False   | Enable for ascend910; Use with use_lite for ascend310 |
+| `request_cache_size`                           | int  | `0`     | Bytes size of request cache. By default, the cache is disabled |
 
 #### Serving model with multiple gpus.
 ```BASH
diff --git a/python/paddle_serving_server/server.py b/python/paddle_serving_server/server.py
index e369c57d4d350207d65d048a96eb052db279bd30..45f826470fff16cfab577caa4937dc81de61a2e9 100755
--- a/python/paddle_serving_server/server.py
+++ b/python/paddle_serving_server/server.py
@@ -599,9 +599,7 @@ class Server(object):
                     "-workflow_path {} " \
                     "-workflow_file {} " \
                     "-bthread_concurrency {} " \
-                    "-max_body_size {} " \
-                    "-enable_prometheus={} " \
-                    "-prometheus_port {} ".format(
+                    "-max_body_size {} ".format(
                         self.bin_path,
                         self.workdir,
                         self.infer_service_fn,
@@ -616,9 +614,7 @@ class Server(object):
                         self.workdir,
                         self.workflow_fn,
                         self.num_threads,
-                        self.max_body_size,
-                        self.enable_prometheus,
-                        self.prometheus_port)
+                        self.max_body_size)
         if self.enable_prometheus:
             command =   command + \
                         "-enable_prometheus={} " \