diff --git a/doc/ELASTIC_CTR.md b/doc/ELASTIC_CTR.md
index b2fc908b1a9edef75c67ad4241dece15909cd469..5cc3eb597b126890396da807161fba8218e71fd3 100755
--- a/doc/ELASTIC_CTR.md
+++ b/doc/ELASTIC_CTR.md
@@ -15,10 +15,10 @@ ELASTIC CTR
 
 本项目提供了端到端的CTR训练和二次开发的解决方案，主要特点：
 
-- 使用K8S集群解决原来在物理集群上训练时，会出现类似于配置参数冗杂，环境搭建繁复等问题。
-- 使用基于Kube-batch开发的Volcano框架来进行任务提交和弹性调度。
-- 使用Paddle Serving来进行模型的上线和预测。
-- 使用Cube作为稀疏参数的分布式存储，在预测端与Paddle Serving对接。
+- 整体方案在k8s环境一键部署，可快速搭建与验证效果
+- 基于Paddle transpiler模式的大规模分布式高速训练
+- 训练资源弹性伸缩
+- 工业级稀疏参数Serving组件，高并发条件下单位时间吞吐总量是redis的13倍 \[[注1](#annotation_1)\]
 
 本方案整体流程如下图所示：
 
@@ -41,7 +41,7 @@ ELASTIC CTR
 -   指定训练的规模，包括参数服务器的数量和训练节点的数量
 -   指定Cube参数服务器的分片数量和副本数量
 
-在本文第4部分会详细解释以上二次开发的实际操作。
+在本文第5节会详细解释以上二次开发的实际操作。
 
 本文主要内容：
 
@@ -348,3 +348,72 @@ $ docker push  ${DOCKER_IMAGE_NAME}
 
 
 关于Paddle Serving的完整开发模式，可参考[Serving从零开始写一个预测服务](https://github.com/PaddlePaddle/Serving/blob/develop/doc/CREATING.md)，以及[Paddle Serving的其他文档](https://github.com/PaddlePaddle/Serving/tree/develop/doc)
+
+# 注释
+
+## 注1. <span id='annotation_1'>Cube和redis性能对比测试环境</span>
+
+Cube和Redis均在百度云环境上部署，测试时只测试单个cube server和redis server节点的性能。
+
+client端和server端分别位于2台独立的云主机，机器间ping延时为0.3ms-0.5ms。
+
+机器配置：Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 32核
+
+
+### Cube测试环境
+
+测试key 64bit整数，value为10个float （40字节）
+
+首先用本方案一键部署脚本部署完成。
+
+用Paddle Serving的cube客户端SDK，编写测试代码
+
+基本原理，启动k个线程，每个线程访问M次cube server，每次批量获取N个key，总时间加和求平均。
+
+并发数 （压测线程数） | batch size | 平均响应时间 (us) | total qps
+-------|------------|-------------|---------------------------
+1	| 1000 | 1312 | 762
+4	| 1000 | 1496 | 2674
+8	| 1000 | 1585 | 5047
+16 | 1000 | 1866 | 8574
+24 | 1000 | 2236 | 10733
+32 | 1000 | 2602 | 12298
+
+### Redis测试环境
+
+测试key 1-1000000之间随机整数，value为40字节字符串
+
+server端部署redis-sever (latest stable 5.0.6)
+
+client端为基于[redisplusplus](https://github.com/sewenew/redis-plus-plus)编写的客户端[get_values.cpp](https://github.com/PaddlePaddle/Serving/blob/master/doc/resource/get_value.cpp)
+
+基本原理：启动k个线程，每个线程访问M次redis server，每次用mget批量获取N个key。总时间加和求平均。
+
+调用方法：
+
+```bash
+$ ./get_values -h 192.168.1.1 -t 3 -r 10000 -b 1000
+```
+
+其中
+\-h server所在主机名
+\-t 并发线程数
+\-r 每线程请求次数
+\-b 每个mget请求的key个数
+
+并发数 （压测线程数） | batch size | 平均响应时间 (us) | total qps
+-------|------------|-------------|---------------------------
+1  | 1000 | 1159 | 862
+4  | 1000 | 3537  | 1079
+8  | 1000 | 7726  | 1073
+16 | 1000 | 15440  | 1034
+24 | 1000 | 24279  | 1004 
+32 | 1000 | 32570 | 996
+
+###测试结论
+
+由于Redis高效的时间驱动模型和全内存操作，在单并发时，redis平均响应时间比cube少接近50% (1100us vs. 1680us)
+
+在扩展性方面，redis受制于单线程模型，随并发数增加，响应时间加倍增加，而总吞吐在1000qps左右即不再上涨；而cube则随着压测并发数增加，总的qps一直上涨，说明cube能够较好处理并发请求，具有良好的扩展能力。
+
+
diff --git a/doc/PROFILING_CUBE.md b/doc/PROFILING_CUBE.md
new file mode 100644
index 0000000000000000000000000000000000000000..208ba17bf224f39d57f1168a38c97a033b26526e
--- /dev/null
+++ b/doc/PROFILING_CUBE.md
@@ -0,0 +1,80 @@
+Paddle Serving的CTR预估任务会定期将访问大规模稀疏参数服务的响应时间等统计信息打印出来。用户在k8s集群一键部署好分布式训练+Serving方案后，可以在容器内通过CTR预估任务demo观察serving访问稀疏参数服务的响应时间等信息。具体的观察方法如下：
+
+## 使用CTR预估任务客户端ctr_prediction向Serving发送批量请求
+
+因Serving端每1000个请求打印一次请求，为了观察输出结果，需要客户端向serving端发送较大量请求。具体做法：
+
+```bash
+# 进入pdservingclient pod
+$ kubectl exec -ti pdservingclient /bin/bash
+
+# 以下命令在pdservingclient这个pod内执行
+$ cd client/ctr_prediction/
+$ bin/ctr_prediction --enable_profiling --concurrency=4 --repeat=100
+```
+
+## Serving端日志
+
+```bash
+# 进入Serving端pod
+$ kubectl exec -ti paddleserving /bin/bash
+
+# 以下命令在Serving pod内执行
+$ grep 'Cube request count' log/serving.INFO -A 5 | more
+```
+
+示例输出：
+```
+I1014 12:57:20.699606    38 ctr_prediction_op.cpp:163] Cube request count: 1000
+I1014 12:57:20.699631    38 ctr_prediction_op.cpp:164] Cube request key count: 1300000
+I1014 12:57:20.699645    38 ctr_prediction_op.cpp:165] Cube request total time: 1465704us
+I1014 12:57:20.699666    38 ctr_prediction_op.cpp:166] Average 1465.7us/req
+I1014 12:57:20.699692    38 ctr_prediction_op.cpp:169] Average 1.12746us/key
+```
+
+## 说明
+
+Paddle Serving实例中打印出的访问cube的平响时间，与[cube社区版本性能报告](https://github.com/PaddlePaddle/Serving/blob/develop/cube/doc/performance.md)中报告的性能数值有差别，影响Paddle Serving访问cube的因素：
+
+1) 批量查询每个请求中key的个数
+
+从日志可看到，Paddle Serving接收到CTR预估任务client发送的请求中，平均每个请求批量key的个数为1300，而性能报告中实测的有批量1个key、100个key和1000个key等
+
+2) Serving实例所在机器CPU核数和客户端请求并发数
+
+假设Paddle Serving所在云服务器上CPU核数为4，则Paddle Serving本身默认会启动4个worker线程。在client端发送4个并发情况下，Serving端约为占满4个CPU核。但由于Serving又要启动新的channel/thread来访问cube（采用的是异步模式），这些和Serving本身的server端代码共用bthread资源，就会出现竞争的情况。
+
+以下是在CTR预估任务Client端向Serving发送不同并发请求数时，访问cube的平均响应时间 (1300key/req，分片数=1)
+
+线程数 | 访问cube的平均响应时间 (us)
+-------|-------
+1 | 1465
+2 | 1480
+3 | 1450
+4 | 1905
+
+可以看到，当并发数等于（和大于）CPU核数时，访问cube的响应时间就会变大。
+
+3) 稀疏参数字典分片数
+
+假设分片数为N，每次cube访问，都会生成N个channel，每个来对应一个分片的请求，这些channel和Serving内其他工作线程共用bthread资源。但同时，较多的分片数，每个分片参数服务节点上查询的计算量会变小，使得总体响应时间变小。
+
+以下是同一份词典分成1个分片和2个分片，serving端访问cube的平均响应时间 (1300key/req)
+
+分片数 | 访问cube的平均响应时间 (us)
+-------|--------------------------
+1 | 1680
+2 | 1450
+
+4) 网络环境
+
+百度云平台上机器间ping的时延平均为0.3ms - 0.5ms，在batch为1300个key时，平均响应时间为1450us
+
+Paddle Serving发布的[cube社区版本性能报告](https://github.com/PaddlePaddle/Serving/blob/develop/cube/doc/performance.md)中测试环境为裸机部署，给出的机器间ping时延为0.06ms，在batch为1000个key时，平均响应时间为675us/req
+
+两种环境的主要差别在于：
+
+(1) 机器间固有的通信延迟 (百度云上位0.3ms-0.5ms，性能报告测试环境0.06ms)
+
+(2) 字典分片数 (百度云上2个分片，性能报告测试环境10个分片)
+
diff --git a/doc/resource/get_value.cpp b/doc/resource/get_value.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f7dd7411fc064baad965d71b1dd4225a8235245c
--- /dev/null
+++ b/doc/resource/get_value.cpp
@@ -0,0 +1,145 @@
+#include <unistd.h>
+#include <iostream>
+#include <string>
+#include <thread>
+
+#include "sw/redis++/redis++.h"
+
+std::string host = "127.0.0.0";
+int port = 6379;
+std::string auth;
+std::string cluster_node;
+int cluster_port = 0;
+bool benchmark = false;
+int key_len = 8;
+int val_len = 40;
+int total_request_num = 1000;
+int thread_num = 1;
+int pool_size = 5;
+int batch_size = 100;
+int key_size = 10000000;        // keys in redis server
+
+std::vector<uint64_t> times_us;
+
+sw::redis::Redis *redis;
+
+int parse_options(int argc, char **argv)
+{
+    int opt = 0;
+    while ((opt = getopt(argc, argv, "h:p:a:n:c:k:v:r:t:b:s:")) != -1) {
+        try {
+            switch (opt) {
+            case 'h':
+                host = optarg;
+                break;
+
+            case 'p':
+                port = std::stoi(optarg);
+                break;
+
+            case 'a':
+                auth = optarg;
+                break;
+
+            case 'n':
+                cluster_node = optarg;
+                break;
+
+            case 'c':
+                cluster_port = std::stoi(optarg);
+                break;
+
+            case 'b':
+                batch_size = std::stoi(optarg);
+                break;
+
+            case 'k':
+                key_len = std::stoi(optarg);
+                break;
+
+            case 'v':
+                val_len = std::stoi(optarg);
+                break;
+
+            case 'r':
+                total_request_num = std::stoi(optarg);
+                break;
+
+            case 't':
+                thread_num = std::stoi(optarg);
+                break;
+
+            case 's':
+                pool_size = std::stoi(optarg);
+                break;
+
+            default:
+                break;
+            }
+        } catch (const sw::redis::Error &e) {
+            std::cerr << "Unknow option" << std::endl;
+        } catch (const std::exception &e) {
+            std::cerr << "Invalid command line option" << std::endl;
+        }
+    }
+
+    return 0;
+}
+
+void thread_worker(int thread_id)
+{
+    // get values
+    for (int i = 0; i < total_request_num; ++i) {
+        std::vector<std::string> get_kvs;
+        std::vector<std::string> get_kvs_res;
+
+         for(int j = i * batch_size; j <  (i + 1) * batch_size; j++) {
+            get_kvs.push_back(std::to_string(i % key_size));
+        }
+        auto start2 = std::chrono::steady_clock::now();
+        redis->mget(get_kvs.begin(), get_kvs.end(), std::back_inserter(get_kvs_res));
+        auto stop2 = std::chrono::steady_clock::now();
+        times_us[thread_id] += std::chrono::duration_cast<std::chrono::microseconds>(stop2 - start2).count();
+    }
+
+    // Per-thread statistics
+    std::cout << total_request_num << " requests, " << batch_size << " keys per req, total time us = " << times_us[thread_id] <<std::endl;
+    std::cout << "Average " << times_us[thread_id] / total_request_num << "us per req" << std::endl;
+    std::cout << "qps: " << (double)total_request_num / times_us[thread_id] * 1000000 << std::endl;
+}
+
+int main(int argc, char **argv)
+{
+    parse_options(argc, argv);
+
+    std::string connstr = std::string("tcp://") + host + std::string(":") + std::to_string(port);
+    redis = new sw::redis::Redis(connstr);
+
+    std::vector<std::thread> workers;
+    times_us.reserve(thread_num);
+
+    for (int i = 0; i < thread_num; ++i) {
+        times_us[i] = 0;
+        workers.push_back(std::thread(thread_worker, i));
+    }
+
+    for (int i = 0; i < thread_num; ++i) {
+        workers[i].join();
+    }
+
+    // times_total_us is average running time of each thread
+    uint64_t times_total_us = 0;
+    for (int i = 0; i < thread_num; ++i) {
+        times_total_us += times_us[i];
+    }
+    times_total_us /= thread_num;
+    
+    // Total requests should be sum of requests sent by each thread
+    total_request_num *= thread_num;
+
+    std::cout << total_request_num << " requests, " << batch_size << " keys per req, total time us = " << times_total_us <<std::endl;
+    std::cout << "Average " << times_total_us / total_request_num << "us per req" << std::endl;
+    std::cout << "qps: " << (double)total_request_num / times_total_us * 1000000 << std::endl;
+
+    return 0;
+}