diff --git a/README.md b/README.md
index f08da24e2f6f0e6d2c7e3632bf27da3e0c20565e..a8ab968086f8986fc792baa134c7b15079615316 100644
--- a/README.md
+++ b/README.md
@@ -8,8 +8,66 @@ PaddlePaddle provides a rich set of computational units to enable users to adopt
 
 - [fluid models](fluid): use PaddlePaddle's Fluid APIs. We especially recommend users to use Fluid models.
 
-- [legacy models](legacy): use PaddlePaddle's v2 APIs.
 
+PaddlePaddle 提供了丰富的计算单元，使得用户可以采用模块化的方法解决各种学习问题。在此repo中，我们展示了如何用 PaddlePaddle 来解决常见的机器学习任务，提供若干种不同的易学易用的神经网络模型。
+
+- [fluid模型](fluid): 使用 PaddlePaddle Fluid版本的 APIs，我们特别推荐您使用Fluid模型。
+
+## PaddleCV
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[AlexNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|首次在CNN中成功的应用了ReLU、Dropout和LRN，并使用GPU进行运算加速|[ImageNet Classification with Deep Convolutional Neural Networks](https://www.researchgate.net/publication/267960550_ImageNet_Classification_with_Deep_Convolutional_Neural_Networks)
+[VGG](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在AlexNet的基础上使用3*3小卷积核，增加网络深度，具有很好的泛化能力|[Very Deep ConvNets for Large-Scale Inage Recognition](https://arxiv.org/pdf/1409.1556.pdf)
+[GoogleNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在不增加计算负载的前提下增加了网络的深度和宽度，性能更加优越|[Going deeper with convolutions](https://ieeexplore.ieee.org/document/7298594)
+[ResNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|残差网络|引入了新的残差结构，解决了随着网络加深，准确率下降的问题|[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
+[Inception-v4](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|更加deeper和wider的inception结构|[Inception-ResNet and the Impact of Residual Connections on Learning](http://arxiv.org/abs/1602.07261)
+[MobileNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|轻量级网络模型|为移动和嵌入式设备提出的高效模型|[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
+[DPN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|结合了DenseNet和ResNeXt的网络结构，对图像分类效果有所提升|[Dual Path Networks](https://arxiv.org/abs/1707.01629)
+[SE-ResNeXt](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|ResNeXt中加入了SE block，提高了模型准确率|[Squeeze-and-excitation networks](https://arxiv.org/abs/1709.01507)
+[SSD](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/object_detection/README_cn.md)|单阶段目标检测器|在不同尺度的特征图上检测对应尺度的目标,可以方便地插入到任何一种标准卷积网络中|[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325)
+[Face Detector: PyramidBox](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/face_detection/README_cn.md)|基于SSD的单阶段人脸检测器|利用上下文信息解决困难人脸的检测问题，网络表达能力高，鲁棒性强|[PyramidBox: A Context-assisted Single Shot Face Detector](https://arxiv.org/pdf/1803.07737.pdf)
+[Faster RCNN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/faster_rcnn/README_cn.md)|典型的两阶段目标检测器|创造性地采用卷积网络自行产生建议框，并且和目标检测网络共享卷积网络，建议框数目减少，质量提高|[Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
+[ICNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/icnet)|图像实时语义分割模型|即考虑了速度，也考虑了准确性，在高分辨率图像的准确性和低复杂度网络的效率之间获得平衡|[ICNet for Real-Time Semantic Segmentation on High-Resolution Images](https://arxiv.org/abs/1704.08545)
+[DCGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|深度卷积生成对抗网络，将GAN和卷积网络结合起来，以解决GAN训练不稳定的问题|[Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/pdf/1511.06434.pdf)
+[ConditionalGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|条件生成对抗网络，一种带条件约束的GAN，使用额外信息对模型增加条件，可以指导数据生成过程|[Conditional Generative Adversarial Nets](https://arxiv.org/abs/1411.1784)
+[CycleGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/cycle_gan)|图片转化模型|自动将某一类图片转换成另外一类图片，可用于风格迁移|[Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/abs/1703.10593)
+[CRNN-CTC模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用CTC model识别图片中单行英文字符|[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https://www.researchgate.net/publication/221346365_Connectionist_temporal_classification_Labelling_unsegmented_sequence_data_with_recurrent_neural_'networks)
+[Attention模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用attention 识别图片中单行英文字符|[Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)
+[Metric Learning](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/metric_learning)|度量学习模型|能够用于分析对象时间的关联、比较关系，可应用于辅助分类、聚类问题，也广泛用于图像检索、人脸识别等领域|-
+[TSN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/video_classification)|视频分类模型|基于长范围时间结构建模，结合了稀疏时间采样策略和视频级监督来保证使用整段视频时学习得有效和高效|[Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859)
+[caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/caffe2fluid)|将Caffe模型转换为Paddle Fluid配置和模型文件工具|-|-
+
+## PaddleNLP
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[Transformer](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/neural_machine_translation/transformer/README_cn.md)|机器翻译模型|基于self-attention，计算复杂度小，并行度高，容易学习长程依赖，翻译效果更好|[Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+[LAC](https://github.com/baidu/lac/blob/master/README.md)|联合的词法分析模型|能够整体性地完成中文分词、词性标注、专名识别任务|[Chinese Lexical Analysis with Deep Bi-GRU-CRF Network](https://arxiv.org/abs/1807.01882)
+[Senta](https://github.com/baidu/Senta/blob/master/README.md)|情感倾向分析模型集|百度AI开放平台中情感倾向分析模型|-
+[DAM](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/deep_attention_matching_net)|语义匹配模型|百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择|[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://aclweb.org/anthology/P18-1103)
+[SimNet](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md)|语义匹配框架|使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力|-
+[DuReader](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/machine_reading_comprehension/README.md)|阅读理解模型|百度MRC数据集上的机器阅读理解模型|-
+[Bi-GRU-CRF](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/sequence_tagging_for_ner/README.md)|命名实体识别|结合了CRF和双向GRU的命名实体识别模型|-
+
+## PaddleRec
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[TagSpace](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/tagspace)|文本及标签的embedding表示学习模型|应用于工业级的标签推荐，具体应用场景有feed新闻标签推荐等|[#TagSpace: Semantic embeddings from hashtags](https://www.bibsonomy.org/bibtex/0ed4314916f8e7c90d066db45c293462)
+[GRU4Rec](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec)|个性化推荐模型|首次将RNN（GRU）运用于session-based推荐，相比传统的KNN和矩阵分解，效果有明显的提升|[Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939)
+[SSR](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/ssr)|序列语义检索推荐模型|使用参考论文中的思想，使用多种时间粒度进行用户行为预测|[Multi-Rate Deep Learning for Temporal Recommendation](https://dl.acm.org/citation.cfm?id=2914726)
+[DeepCTR](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleRec/ctr/README.cn.md)|点击率预估模型|只实现了DeepFM论文中介绍的模型的DNN部分，DeepFM会在其他例子中给出|[DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/abs/1703.04247)
+[Multiview-Simnet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/multiview_simnet)|个性化推荐模型|基于多元视图，将用户和项目的多个功能视图合并为一个统一模型|[A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](http://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf)
+
+## Other Models
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[DeepASR](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/README_cn.md)|语音识别系统|利用Fluid框架完成语音识别中声学模型的配置和训练，并集成 Kaldi 的解码器|-
+[DQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|深度Q网络|value based强化学习算法，第一个成功地将深度学习和强化学习结合起来的模型|[Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)
+[DoubleDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|将Double Q的想法应用在DQN上，解决过优化问题|[Font Size: Deep Reinforcement Learning with Double Q-Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
+[DuelingDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|改进了DQN模型，提高了模型的性能|[Dueling Network Architectures for Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.html)
 
 ## License
 This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](LICENSE).
+
+
+## 许可证书
+此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献，受[Apache-2.0 license](LICENSE)许可认证.
diff --git a/fluid/PaddleCV/HiNAS_models/nn_paddle.py b/fluid/PaddleCV/HiNAS_models/nn_paddle.py
index d56bca5f156f47dccad07d32e7ad9d383d3dd459..d3a3ddd60cf3e5e114de322f3eea763e5a2e6018 100755
--- a/fluid/PaddleCV/HiNAS_models/nn_paddle.py
+++ b/fluid/PaddleCV/HiNAS_models/nn_paddle.py
@@ -21,6 +21,7 @@ import math
 import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid.contrib.trainer import *
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
 import reader
 
@@ -104,7 +105,7 @@ class Model(object):
         accs = []
 
         def event_handler(event):
-            if isinstance(event, fluid.EndStepEvent):
+            if isinstance(event, EndStepEvent):
                 costs.append(event.metrics[0])
                 accs.append(event.metrics[1])
                 if event.step % 20 == 0:
@@ -113,7 +114,7 @@ class Model(object):
                     del costs[:]
                     del accs[:]
 
-            if isinstance(event, fluid.EndEpochEvent):
+            if isinstance(event, EndEpochEvent):
                 if event.epoch % 3 == 0 or event.epoch == FLAGS.num_epochs - 1:
                     avg_cost, accuracy = trainer.test(
                         reader=test_reader, feed_order=['pixel', 'label'])
@@ -126,7 +127,7 @@ class Model(object):
 
         event_handler.best_acc = 0.0
         place = fluid.CUDAPlace(0)
-        trainer = fluid.Trainer(
+        trainer = Trainer(
             train_func=self.train_network,
             optimizer_func=self.optimizer_program,
             place=place)
diff --git a/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py b/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py
index aa910797af66d79046e753b9039be4bffc6cc1ab..718bd196fa107b7adf20ff09d1ec192b090af8cd 100644
--- a/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py
+++ b/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py
@@ -440,7 +440,8 @@ class Network(object):
 
         if need_transpose:
             order = range(dims)
-            order.remove(axis).append(axis)
+            order.remove(axis)
+            order.append(axis)
             input = fluid.layers.transpose(
                 input,
                 perm=order,
@@ -525,11 +526,21 @@ class Network(object):
         scale_shape = input.shape[axis:axis + num_axes]
         param_attr = fluid.ParamAttr(name=prefix + 'scale')
         scale_param = fluid.layers.create_parameter(
-            shape=scale_shape, dtype=input.dtype, name=name, attr=param_attr)
+            shape=scale_shape,
+            dtype=input.dtype,
+            name=name,
+            attr=param_attr,
+            is_bias=True,
+            default_initializer=fluid.initializer.Constant(value=1.0))
 
         offset_attr = fluid.ParamAttr(name=prefix + 'offset')
         offset_param = fluid.layers.create_parameter(
-            shape=scale_shape, dtype=input.dtype, name=name, attr=offset_attr)
+            shape=scale_shape,
+            dtype=input.dtype,
+            name=name,
+            attr=offset_attr,
+            is_bias=True,
+            default_initializer=fluid.initializer.Constant(value=0.0))
 
         output = fluid.layers.elementwise_mul(
             input,
diff --git a/fluid/PaddleCV/deeplabv3+/.run_ce.sh b/fluid/PaddleCV/deeplabv3+/.run_ce.sh
new file mode 100755
index 0000000000000000000000000000000000000000..540fb964ba94fd29dc28bb51342cdba839d433e7
--- /dev/null
+++ b/fluid/PaddleCV/deeplabv3+/.run_ce.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+
+DATASET_PATH=${HOME}/.cache/paddle/dataset/cityscape/
+
+cudaid=${deeplabv3plus:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py \
+--batch_size=2 \
+--train_crop_size=769 \
+--total_step=50 \
+--save_weights_path=output1 \
+--dataset_path=$DATASET_PATH \
+--enable_ce | python _ce.py
+
+cudaid=${deeplabv3plus_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py \
+--batch_size=2 \
+--train_crop_size=769 \
+--total_step=50 \
+--save_weights_path=output4 \
+--dataset_path=$DATASET_PATH \
+--enable_ce | python _ce.py
diff --git a/fluid/PaddleCV/deeplabv3+/README.md b/fluid/PaddleCV/deeplabv3+/README.md
index 9ff68ab8c1ded0eb41078886aac7a1ec49f02355..97e1600db9ff6e2f9de2a254681a2eacb2f9359b 100644
--- a/fluid/PaddleCV/deeplabv3+/README.md
+++ b/fluid/PaddleCV/deeplabv3+/README.md
@@ -76,7 +76,7 @@ python ./train.py \
     --train_crop_size=769 \
     --total_step=90000 \
     --init_weights_path=deeplabv3plus_xception65_initialize.params \
-    --save_weights_path=output \
+    --save_weights_path=output/ \
     --dataset_path=$DATASET_PATH
 ```
 
diff --git a/fluid/PaddleRec/ssr/__init__.py b/fluid/PaddleCV/deeplabv3+/__init__.py
similarity index 100%
rename from fluid/PaddleRec/ssr/__init__.py
rename to fluid/PaddleCV/deeplabv3+/__init__.py
diff --git a/fluid/PaddleCV/deeplabv3+/_ce.py b/fluid/PaddleCV/deeplabv3+/_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..b0127d6445213b9d3934220fa36e9eb44d3e04b4
--- /dev/null
+++ b/fluid/PaddleCV/deeplabv3+/_ce.py
@@ -0,0 +1,60 @@
+# this file is only used for continuous evaluation test!
+
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.1, 0, actived=True)
+train_loss_card1_kpi = CostKpi('train_loss_card1', 0.05, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.1, 0, actived=True)
+train_loss_card4_kpi = CostKpi('train_loss_card4', 0.05, 0)
+
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_loss_card4_kpi,
+        ]
+
+
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+
+    The suggestion:
+
+    each line in the log should be key, value, for example:
+
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+
+
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+
+
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
diff --git a/fluid/PaddleCV/deeplabv3+/eval.py b/fluid/PaddleCV/deeplabv3+/eval.py
index 624159a54d3ff55e29d9f5ac71c673e5e396d9e7..5699f2fac3ff52e39932eba71e8d25a189bf8fc6 100644
--- a/fluid/PaddleCV/deeplabv3+/eval.py
+++ b/fluid/PaddleCV/deeplabv3+/eval.py
@@ -26,6 +26,7 @@ def add_arguments():
     add_argument('dataset_path', str, None, "Cityscape dataset path.")
     add_argument('verbose', bool, False, "Print mIoU for each step if verbose.")
     add_argument('use_gpu', bool, True, "Whether use GPU or CPU.")
+    add_argument('num_classes', int, 19, "Number of classes.")
 
 
 def mean_iou(pred, label):
@@ -69,7 +70,7 @@ tp = fluid.Program()
 batch_size = 1
 reader.default_config['crop_size'] = -1
 reader.default_config['shuffle'] = False
-num_classes = 19
+num_classes = args.num_classes
 
 with fluid.program_guard(tp, sp):
     img = fluid.layers.data(name='img', shape=[3, 0, 0], dtype='float32')
@@ -84,7 +85,7 @@ tp = tp.clone(True)
 fluid.memory_optimize(
     tp,
     print_log=False,
-    skip_opt_set=[pred.name, miou, out_wrong, out_correct],
+    skip_opt_set=set([pred.name, miou, out_wrong, out_correct]),
     level=1)
 
 place = fluid.CPUPlace()
diff --git a/fluid/PaddleCV/deeplabv3+/models.py b/fluid/PaddleCV/deeplabv3+/models.py
index feca2142293ee2169fbe0d2bdc82f1d950af00de..c1ea12296af3e9b6e0bb783cfa10efe5adfa15aa 100644
--- a/fluid/PaddleCV/deeplabv3+/models.py
+++ b/fluid/PaddleCV/deeplabv3+/models.py
@@ -20,6 +20,11 @@ op_results = {}
 default_epsilon = 1e-3
 default_norm_type = 'bn'
 default_group_number = 32
+depthwise_use_cudnn = False
+
+bn_regularizer = fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0)
+depthwise_regularizer = fluid.regularizer.L2DecayRegularizer(
+    regularization_coeff=0.0)
 
 
 @contextlib.contextmanager
@@ -52,20 +57,39 @@ def append_op_result(result, name):
 
 
 def conv(*args, **kargs):
-    kargs['param_attr'] = name_scope + 'weights'
+    if "xception" in name_scope:
+        init_std = 0.09
+    elif "logit" in name_scope:
+        init_std = 0.01
+    elif name_scope.endswith('depthwise/'):
+        init_std = 0.33
+    else:
+        init_std = 0.06
+    if name_scope.endswith('depthwise/'):
+        regularizer = depthwise_regularizer
+    else:
+        regularizer = None
+
+    kargs['param_attr'] = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=regularizer,
+        initializer=fluid.initializer.TruncatedNormal(
+            loc=0.0, scale=init_std))
     if 'bias_attr' in kargs and kargs['bias_attr']:
-        kargs['bias_attr'] = name_scope + 'biases'
+        kargs['bias_attr'] = fluid.ParamAttr(
+            name=name_scope + 'biases',
+            regularizer=regularizer,
+            initializer=fluid.initializer.ConstantInitializer(value=0.0))
     else:
         kargs['bias_attr'] = False
+    kargs['name'] = name_scope + 'conv'
     return append_op_result(fluid.layers.conv2d(*args, **kargs), 'conv')
 
 
 def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
-    helper = fluid.layer_helper.LayerHelper('group_norm', **locals())
-
     N, C, H, W = input.shape
     if C % G != 0:
-        print("group can not divide channle:", C, G)
+        # print "group can not divide channle:", C, G
         for d in range(10):
             for t in [d, -d]:
                 if G + t <= 0: continue
@@ -73,29 +97,16 @@ def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
                     G = G + t
                     break
             if C % G == 0:
-                print("use group size:", G)
+                # print "use group size:", G
                 break
     assert C % G == 0
-    param_shape = (G, )
-    x = input
-    x = fluid.layers.reshape(x, [N, G, C // G * H * W])
-    mean = fluid.layers.reduce_mean(x, dim=2, keep_dim=True)
-    x = x - mean
-    var = fluid.layers.reduce_mean(fluid.layers.square(x), dim=2, keep_dim=True)
-    x = x / fluid.layers.sqrt(var + eps)
-
-    scale = helper.create_parameter(
-        attr=helper.param_attr,
-        shape=param_shape,
-        dtype='float32',
-        default_initializer=fluid.initializer.Constant(1.0))
-
-    bias = helper.create_parameter(
-        attr=helper.bias_attr, shape=param_shape, dtype='float32', is_bias=True)
-    x = fluid.layers.elementwise_add(
-        fluid.layers.elementwise_mul(
-            x, scale, axis=1), bias, axis=1)
-    return fluid.layers.reshape(x, input.shape)
+    x = fluid.layers.group_norm(
+        input,
+        groups=G,
+        param_attr=param_attr,
+        bias_attr=bias_attr,
+        name=name_scope + 'group_norm')
+    return x
 
 
 def bn(*args, **kargs):
@@ -106,8 +117,10 @@ def bn(*args, **kargs):
                     *args,
                     epsilon=default_epsilon,
                     momentum=bn_momentum,
-                    param_attr=name_scope + 'gamma',
-                    bias_attr=name_scope + 'beta',
+                    param_attr=fluid.ParamAttr(
+                        name=name_scope + 'gamma', regularizer=bn_regularizer),
+                    bias_attr=fluid.ParamAttr(
+                        name=name_scope + 'beta', regularizer=bn_regularizer),
                     moving_mean_name=name_scope + 'moving_mean',
                     moving_variance_name=name_scope + 'moving_variance',
                     **kargs),
@@ -119,8 +132,10 @@ def bn(*args, **kargs):
                     args[0],
                     default_group_number,
                     eps=default_epsilon,
-                    param_attr=name_scope + 'gamma',
-                    bias_attr=name_scope + 'beta'),
+                    param_attr=fluid.ParamAttr(
+                        name=name_scope + 'gamma', regularizer=bn_regularizer),
+                    bias_attr=fluid.ParamAttr(
+                        name=name_scope + 'beta', regularizer=bn_regularizer)),
                 'gn')
     else:
         raise "Unsupport norm type:" + default_norm_type
@@ -143,7 +158,8 @@ def seq_conv(input, channel, stride, filter, dilation=1, act=None):
             stride,
             groups=input.shape[1],
             padding=(filter // 2) * dilation,
-            dilation=dilation)
+            dilation=dilation,
+            use_cudnn=depthwise_use_cudnn)
         input = bn(input)
         if act: input = act(input)
     with scope('pointwise'):
diff --git a/fluid/PaddleCV/deeplabv3+/train.py b/fluid/PaddleCV/deeplabv3+/train.py
old mode 100644
new mode 100755
index fcc038b137349877e06c98a6d533669353bb4b34..e009f76e0e16be9e4a5db532615cefac258fada1
--- a/fluid/PaddleCV/deeplabv3+/train.py
+++ b/fluid/PaddleCV/deeplabv3+/train.py
@@ -13,6 +13,7 @@ import reader
 import models
 import time
 
+
 def add_argument(name, type, default, help):
     parser.add_argument('--' + name, default=default, type=type, help=help)
 
@@ -32,15 +33,35 @@ def add_arguments():
     add_argument('dataset_path', str, None, "Cityscape dataset path.")
     add_argument('parallel', bool, False, "using ParallelExecutor.")
     add_argument('use_gpu', bool, True, "Whether use GPU or CPU.")
+    add_argument('num_classes', int, 19, "Number of classes.")
+    parser.add_argument(
+        '--enable_ce',
+        action='store_true',
+        help='If set, run the task with continuous evaluation logs.')
 
 
 def load_model():
+    myvars = [
+        x for x in tp.list_vars()
+        if isinstance(x, fluid.framework.Parameter) and x.name.find('logit') ==
+        -1
+    ]
     if args.init_weights_path.endswith('/'):
-        fluid.io.load_params(
-            exe, dirname=args.init_weights_path, main_program=tp)
+        if args.num_classes == 19:
+            fluid.io.load_params(
+                exe, dirname=args.init_weights_path, main_program=tp)
+        else:
+            fluid.io.load_vars(exe, dirname=args.init_weights_path, vars=myvars)
     else:
-        fluid.io.load_params(
-            exe, dirname="", filename=args.init_weights_path, main_program=tp)
+        if args.num_classes == 19:
+            fluid.io.load_params(
+                exe,
+                dirname="",
+                filename=args.init_weights_path,
+                main_program=tp)
+        else:
+            fluid.io.load_vars(
+                exe, dirname="", filename=args.init_weights_path, vars=myvars)
 
 
 def save_model():
@@ -70,6 +91,15 @@ def loss(logit, label):
     return loss, label_nignore
 
 
+def get_cards(args):
+    if args.enable_ce:
+        cards = os.environ.get('CUDA_VISIBLE_DEVICES')
+        num = len(cards.split(","))
+        return num
+    else:
+        return args.num_devices
+
+
 CityscapeDataset = reader.CityscapeDataset
 parser = argparse.ArgumentParser()
 
@@ -80,16 +110,24 @@ args = parser.parse_args()
 models.clean()
 models.bn_momentum = 0.9997
 models.dropout_keep_prop = 0.9
+models.label_number = args.num_classes
 deeplabv3p = models.deeplabv3p
 
 sp = fluid.Program()
 tp = fluid.Program()
+
+# only for ce
+if args.enable_ce:
+    SEED = 102
+    sp.random_seed = SEED
+    tp.random_seed = SEED
+
 crop_size = args.train_crop_size
 batch_size = args.batch_size
 image_shape = [crop_size, crop_size]
 reader.default_config['crop_size'] = crop_size
 reader.default_config['shuffle'] = True
-num_classes = 19
+num_classes = args.num_classes
 weight_decay = 0.00004
 
 base_lr = args.base_lr
@@ -120,7 +158,7 @@ with fluid.program_guard(tp, sp):
     retv = opt.minimize(loss_mean, startup_program=sp, no_grad_set=no_grad_set)
 
 fluid.memory_optimize(
-    tp, print_log=False, skip_opt_set=[pred.name, loss_mean.name], level=1)
+    tp, print_log=False, skip_opt_set=set([pred.name, loss_mean.name]), level=1)
 
 place = fluid.CPUPlace()
 if args.use_gpu:
@@ -140,7 +178,13 @@ if args.parallel:
 
 batches = dataset.get_batch_generator(batch_size, total_step)
 
+total_time = 0.0
+epoch_idx = 0
+train_loss = 0
+
 for i, imgs, labels, names in batches:
+    epoch_idx += 1
+    begin_time = time.time()
     prev_start_time = time.time()
     if args.parallel:
         retv = exe_p.run(fetch_list=[pred.name, loss_mean.name],
@@ -152,11 +196,21 @@ for i, imgs, labels, names in batches:
                              'label': labels},
                        fetch_list=[pred, loss_mean])
     end_time = time.time()
+    total_time += end_time - begin_time
     if i % 100 == 0:
         print("Model is saved to", args.save_weights_path)
         save_model()
-    print("step {:d}, loss: {:.6f}, step_time_cost: {:.3f}" .format(i,
-                    np.mean(retv[1]), end_time - prev_start_time))
+    print("step {:d}, loss: {:.6f}, step_time_cost: {:.3f}".format(
+        i, np.mean(retv[1]), end_time - prev_start_time))
+
+    # only for ce
+    train_loss = np.mean(retv[1])
+
+if args.enable_ce:
+    gpu_num = get_cards(args)
+    print("kpis\teach_pass_duration_card%s\t%s" %
+          (gpu_num, total_time / epoch_idx))
+    print("kpis\ttrain_loss_card%s\t%s" % (gpu_num, train_loss))
 
 print("Training done. Model is saved to", args.save_weights_path)
 save_model()
diff --git a/fluid/PaddleCV/face_detection/.run_ce.sh b/fluid/PaddleCV/face_detection/.run_ce.sh
new file mode 100755
index 0000000000000000000000000000000000000000..0b8632516b06b9ea48691a098b7ac25b171decd5
--- /dev/null
+++ b/fluid/PaddleCV/face_detection/.run_ce.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+
+
+cudaid=${face_detection:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py --batch_size=2 --epoc_num=1 --batch_num=200 --parallel=False --enable_ce | python _ce.py
+
+
+cudaid=${face_detection_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py --batch_size=8 --epoc_num=1 --batch_num=200 --parallel=False --enable_ce | python _ce.py
+
diff --git a/fluid/PaddleCV/face_detection/README_cn.md b/fluid/PaddleCV/face_detection/README_cn.md
index 80485009d24e278a00b3d21001602fbe6ef9eef6..f63fbed02ab34520d79b2d2b000e31f5eb22e7f8 100644
--- a/fluid/PaddleCV/face_detection/README_cn.md
+++ b/fluid/PaddleCV/face_detection/README_cn.md
@@ -99,7 +99,7 @@ python -u train.py --batch_size=16 --pretrained_model=vgg_ilsvrc_16_fc_reduced
 
 模型训练所采用的数据增强：
 
-**数据增强**：数据的读取行为定义在 `reader.py` 中，所有的图片都会被缩放到640x640。在训练时还会对图片进行数据增强，包括随机扰动、翻转、裁剪等，和[物体检测SSD算法](https://github.com/PaddlePaddle/models/blob/develop/fluid/object_detection/README_cn.md#%E8%AE%AD%E7%BB%83-pascal-voc-%E6%95%B0%E6%8D%AE%E9%9B%86)中数据增强类似，除此之外，增加了上面提到的Data-anchor-sampling:
+**数据增强**：数据的读取行为定义在 `reader.py` 中，所有的图片都会被缩放到640x640。在训练时还会对图片进行数据增强，包括随机扰动、翻转、裁剪等，和[物体检测SSD算法](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/object_detection/README.md)中数据增强类似，除此之外，增加了上面提到的Data-anchor-sampling:
 
   **尺度变换(Data-anchor-sampling)**：随机将图片尺度变换到一定范围的尺度，大大增强人脸的尺度变化。具体操作为根据随机选择的人脸高(height)和宽(width)，得到$v=\\sqrt{width * height}$，判断$v$的值位于缩放区间$[16，32，64，128，256，512]$中的的哪一个。假设$v=45$，则选定$32<v<64$，以均匀分布的概率选取$[16，32，64]$中的任意一个值。若选中$64$，则该人脸的缩放区间在 $[64 / 2，min(v * 2, 64 * 2)]$中选定。
 
diff --git a/fluid/PaddleCV/face_detection/__init__.py b/fluid/PaddleCV/face_detection/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fluid/PaddleCV/face_detection/_ce.py b/fluid/PaddleCV/face_detection/_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d8325dca6d8f7a08c8f7f0b734d2643b3c550b1
--- /dev/null
+++ b/fluid/PaddleCV/face_detection/_ce.py
@@ -0,0 +1,65 @@
+# this file is only used for continuous evaluation test!
+
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+
+
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.08, 0, actived=True)
+train_face_loss_card1_kpi = CostKpi('train_face_loss_card1', 0.08, 0)
+train_head_loss_card1_kpi = CostKpi('train_head_loss_card1', 0.08, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.08, 0, actived=True)
+train_face_loss_card4_kpi = CostKpi('train_face_loss_card4', 0.08, 0)
+train_head_loss_card4_kpi = CostKpi('train_head_loss_card4', 0.08, 0)
+
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_face_loss_card1_kpi,
+        train_head_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_face_loss_card4_kpi,
+        train_head_loss_card4_kpi,
+        ]
+
+
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+
+    The suggestion:
+
+    each line in the log should be key, value, for example:
+
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+
+
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+
+
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
diff --git a/fluid/PaddleCV/face_detection/data_util.py b/fluid/PaddleCV/face_detection/data_util.py
deleted file mode 100644
index a8f6aac6ba8a418f5d4645d167122a3bc4cb125b..0000000000000000000000000000000000000000
--- a/fluid/PaddleCV/face_detection/data_util.py
+++ /dev/null
@@ -1,157 +0,0 @@
-"""
-This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import time
-import numpy as np
-import threading
-import multiprocessing
-import traceback
-try:
-    import queue
-except ImportError:
-    import Queue as queue
-
-
-class GeneratorEnqueuer(object):
-    """
-    Builds a queue out of a data generator.
-
-    Args:
-        generator: a generator function which endlessly yields data
-        use_multiprocessing (bool): use multiprocessing if True,
-            otherwise use threading.
-        wait_time (float): time to sleep in-between calls to `put()`.
-        random_seed (int): Initial seed for workers,
-            will be incremented by one for each workers.
-    """
-
-    def __init__(self,
-                 generator,
-                 use_multiprocessing=False,
-                 wait_time=0.05,
-                 random_seed=None):
-        self.wait_time = wait_time
-        self._generator = generator
-        self._use_multiprocessing = use_multiprocessing
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-        self._manager = None
-        self.seed = random_seed
-
-    def start(self, workers=1, max_queue_size=10):
-        """
-        Start worker threads which add data from the generator into the queue.
-
-        Args:
-            workers (int): number of worker threads
-            max_queue_size (int): queue size
-                (when full, threads could block on `put()`)
-        """
-
-        def data_generator_task():
-            """
-            Data generator task.
-            """
-
-            def task():
-                if (self.queue is not None and
-                        self.queue.qsize() < max_queue_size):
-                    generator_output = next(self._generator)
-                    self.queue.put((generator_output))
-                else:
-                    time.sleep(self.wait_time)
-
-            if not self._use_multiprocessing:
-                while not self._stop_event.is_set():
-                    with self.genlock:
-                        try:
-                            task()
-                        except Exception:
-                            traceback.print_exc()
-                            self._stop_event.set()
-                            break
-            else:
-                while not self._stop_event.is_set():
-                    try:
-                        task()
-                    except Exception:
-                        traceback.print_exc()
-                        self._stop_event.set()
-                        break
-
-        try:
-            if self._use_multiprocessing:
-                self._manager = multiprocessing.Manager()
-                self.queue = self._manager.Queue(maxsize=max_queue_size)
-                self._stop_event = multiprocessing.Event()
-            else:
-                self.genlock = threading.Lock()
-                self.queue = queue.Queue()
-                self._stop_event = threading.Event()
-            for _ in range(workers):
-                if self._use_multiprocessing:
-                    # Reset random seed else all children processes
-                    # share the same seed
-                    np.random.seed(self.seed)
-                    thread = multiprocessing.Process(target=data_generator_task)
-                    thread.daemon = True
-                    if self.seed is not None:
-                        self.seed += 1
-                else:
-                    thread = threading.Thread(target=data_generator_task)
-                self._threads.append(thread)
-                thread.start()
-        except:
-            self.stop()
-            raise
-
-    def is_running(self):
-        """
-        Returns:
-            bool: Whether the worker theads are running.
-        """
-        return self._stop_event is not None and not self._stop_event.is_set()
-
-    def stop(self, timeout=None):
-        """
-        Stops running threads and wait for them to exit, if necessary.
-        Should be called by the same thread which called `start()`.
-
-        Args:
-            timeout(int|None): maximum time to wait on `thread.join()`.
-        """
-        if self.is_running():
-            self._stop_event.set()
-        for thread in self._threads:
-            if self._use_multiprocessing:
-                if thread.is_alive():
-                    thread.terminate()
-            else:
-                thread.join(timeout)
-        if self._manager:
-            self._manager.shutdown()
-
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-
-    def get(self):
-        """
-        Creates a generator to extract data from the queue.
-        Skip the data if it is `None`.
-
-        # Yields
-            tuple of data in the queue.
-        """
-        while self.is_running():
-            if not self.queue.empty():
-                inputs = self.queue.get()
-                if inputs is not None:
-                    yield inputs
-            else:
-                time.sleep(self.wait_time)
diff --git a/fluid/PaddleCV/face_detection/reader.py b/fluid/PaddleCV/face_detection/reader.py
index 2b38952d2d419ec5b658c762d2668f724dc92a09..4839ba5c5389a696fe0cb5f4fcd24daff42f217f 100644
--- a/fluid/PaddleCV/face_detection/reader.py
+++ b/fluid/PaddleCV/face_detection/reader.py
@@ -16,8 +16,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import image_util
-from paddle.utils.image_util import *
 from PIL import Image
 from PIL import ImageDraw
 import numpy as np
@@ -28,7 +26,10 @@ import copy
 import random
 import cv2
 import six
-from data_util import GeneratorEnqueuer
+import math
+from itertools import islice
+import paddle
+import image_util
 
 
 class Settings(object):
@@ -199,7 +200,7 @@ def load_file_list(input_txt):
             else:
                 file_dict[num_class].append(line_txt)
 
-    return file_dict
+    return list(file_dict.values())
 
 
 def expand_bboxes(bboxes,
@@ -227,13 +228,12 @@ def expand_bboxes(bboxes,
 
 
 def train_generator(settings, file_list, batch_size, shuffle=True):
-    file_dict = load_file_list(file_list)
-    while True:
+    def reader():
         if shuffle:
-            np.random.shuffle(file_dict)
+            np.random.shuffle(file_list)
         batch_out = []
-        for index_image in file_dict.keys():
-            image_name = file_dict[index_image][0]
+        for item in file_list:
+            image_name = item[0]
             image_path = os.path.join(settings.data_dir, image_name)
             im = Image.open(image_path)
             if im.mode == 'L':
@@ -242,10 +242,10 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
 
             # layout: label | xmin | ymin | xmax | ymax
             bbox_labels = []
-            for index_box in range(len(file_dict[index_image])):
+            for index_box in range(len(item)):
                 if index_box >= 2:
                     bbox_sample = []
-                    temp_info_box = file_dict[index_image][index_box].split(' ')
+                    temp_info_box = item[index_box].split(' ')
                     xmin = float(temp_info_box[0])
                     ymin = float(temp_info_box[1])
                     w = float(temp_info_box[2])
@@ -277,43 +277,25 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
                 yield batch_out
                 batch_out = []
 
+    return reader
 
-def train(settings,
-          file_list,
-          batch_size,
-          shuffle=True,
-          use_multiprocessing=True,
-          num_workers=8,
-          max_queue=24):
-    def reader():
-        try:
-            enqueuer = GeneratorEnqueuer(
-                train_generator(settings, file_list, batch_size, shuffle),
-                use_multiprocessing=use_multiprocessing)
-            enqueuer.start(max_queue_size=max_queue, workers=num_workers)
-            generator_output = None
-            while True:
-                while enqueuer.is_running():
-                    if not enqueuer.queue.empty():
-                        generator_output = enqueuer.queue.get()
-                        break
-                    else:
-                        time.sleep(0.01)
-                yield generator_output
-                generator_output = None
-        finally:
-            if enqueuer is not None:
-                enqueuer.stop()
 
-    return reader
+def train(settings, file_list, batch_size, shuffle=True, num_workers=8):
+    file_lists = load_file_list(file_list)
+    n = int(math.ceil(len(file_lists) // num_workers))
+    split_lists = [file_lists[i:i + n] for i in range(0, len(file_lists), n)]
+    readers = []
+    for iterm in split_lists:
+        readers.append(train_generator(settings, iterm, batch_size, shuffle))
+    return paddle.reader.multiprocess_reader(readers, False)
 
 
 def test(settings, file_list):
-    file_dict = load_file_list(file_list)
+    file_lists = load_file_list(file_list)
 
     def reader():
-        for index_image in file_dict.keys():
-            image_name = file_dict[index_image][0]
+        for image in file_lists:
+            image_name = image[0]
             image_path = os.path.join(settings.data_dir, image_name)
             im = Image.open(image_path)
             if im.mode == 'L':
diff --git a/fluid/PaddleCV/face_detection/train.py b/fluid/PaddleCV/face_detection/train.py
index 67cec03b95ba5ffe1a5230c287bd12a49b90bb34..2108bcc32a378bbb0803032108ddafea4161e202 100644
--- a/fluid/PaddleCV/face_detection/train.py
+++ b/fluid/PaddleCV/face_detection/train.py
@@ -32,6 +32,9 @@ add_arg('mean_BGR',         str,   '104., 117., 123.', "Mean value for B,G,R cha
 add_arg('with_mem_opt',     bool,  True,            "Whether to use memory optimization or not.")
 add_arg('pretrained_model', str,   './vgg_ilsvrc_16_fc_reduced/', "The init model path.")
 add_arg('data_dir',         str,   'data',          "The base dir of dataset")
+parser.add_argument('--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
+parser.add_argument('--batch_num', type=int, help="batch num for ce")
+parser.add_argument('--num_devices', type=int, default=1, help='Number of GPU devices')
 #yapf: enable
 
 train_parameters = {
@@ -119,6 +122,16 @@ def train(args, config, train_params, train_file_list):
     startup_prog = fluid.Program()
     train_prog = fluid.Program()
 
+    #only for ce
+    if args.enable_ce:
+        SEED = 102
+        startup_prog.random_seed = SEED
+        train_prog.random_seed = SEED
+        num_workers = 1
+        pretrained_model = ""
+        if args.batch_num != None:
+            iters_per_epoc = args.batch_num
+
     train_py_reader, fetches, loss = build_program(
         train_params = train_params,
         main_prog = train_prog,
@@ -150,9 +163,7 @@ def train(args, config, train_params, train_file_list):
                                 train_file_list,
                                 batch_size_per_device,
                                 shuffle = is_shuffle,
-                                use_multiprocessing=True,
-                                num_workers = num_workers,
-                                max_queue=24)
+                                num_workers = num_workers)
     train_py_reader.decorate_paddle_reader(train_reader)
 
     if args.parallel:
@@ -169,42 +180,69 @@ def train(args, config, train_params, train_file_list):
         print('save models to %s' % (model_path))
         fluid.io.save_persistables(exe, model_path, main_program=program)
 
-    train_py_reader.start()
-    try:
-        for pass_id in range(start_epoc, epoc_num):
-            start_time = time.time()
-            prev_start_time = start_time
-            end_time = 0
-            batch_id = 0
-            for batch_id in range(iters_per_epoc):
+    total_time = 0.0
+    epoch_idx = 0
+    face_loss = 0
+    head_loss = 0
+    for pass_id in range(start_epoc, epoc_num):
+        epoch_idx += 1
+        start_time = time.time()
+        prev_start_time = start_time
+        end_time = 0
+        batch_id = 0
+        train_py_reader.start()
+        while True:
+            try:
                 prev_start_time = start_time
                 start_time = time.time()
                 if args.parallel:
                     fetch_vars = train_exe.run(fetch_list=
                         [v.name for v in fetches])
                 else:
-                    fetch_vars = exe.run(train_prog,
-                                         fetch_list=fetches)
+                    fetch_vars = exe.run(train_prog, fetch_list=fetches)
                 end_time = time.time()
                 fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
+                face_loss = fetch_vars[0]
+                head_loss = fetch_vars[1]
                 if batch_id % 10 == 0:
                     if not args.use_pyramidbox:
                         print("Pass {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
-                            pass_id, batch_id, fetch_vars[0],
+                            pass_id, batch_id, face_loss,
                             start_time - prev_start_time))
                     else:
                         print("Pass {:d}, batch {:d}, face loss {:.6f}, " \
                               "head loss {:.6f}, " \
                               "time {:.5f}".format(pass_id,
-                               batch_id, fetch_vars[0], fetch_vars[1],
+                               batch_id, face_loss, head_loss,
                                start_time - prev_start_time))
-            if pass_id % 1 == 0 or pass_id == epoc_num - 1:
-                save_model(str(pass_id), train_prog)
-    except fluid.core.EOFException:
-        train_py_reader.reset()
-    except StopIteration:
-        train_py_reader.reset()
-    train_py_reader.reset()
+                batch_id += 1
+            except (fluid.core.EOFException, StopIteration):
+                train_py_reader.reset()
+                break
+        epoch_end_time = time.time()
+        total_time += epoch_end_time - start_time
+        save_model(str(pass_id), train_prog)
+
+    # only for ce
+    if args.enable_ce:
+        gpu_num = get_cards(args)
+        print("kpis\teach_pass_duration_card%s\t%s" %
+                (gpu_num, total_time / epoch_idx))
+        print("kpis\ttrain_face_loss_card%s\t%s" %
+                (gpu_num, face_loss))
+        print("kpis\ttrain_head_loss_card%s\t%s" %
+                (gpu_num, head_loss))
+
+
+
+def get_cards(args):
+    if args.enable_ce:
+        cards = os.environ.get('CUDA_VISIBLE_DEVICES')
+        num = len(cards.split(","))
+        return num
+    else:
+        return args.num_devices
+
 
 if __name__ == '__main__':
     args = parser.parse_args()
diff --git a/fluid/PaddleCV/faster_rcnn/.run_ce.sh b/fluid/PaddleCV/faster_rcnn/.run_ce.sh
new file mode 100755
index 0000000000000000000000000000000000000000..af2f21da73fa14589258b565a74f78e25dbd4e84
--- /dev/null
+++ b/fluid/PaddleCV/faster_rcnn/.run_ce.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+
+
+cudaid=${face_detection:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py --model_save_dir=output/ --data_dir=dataset/coco/ --max_iter=10 --enable_ce --pretrained_model=./imagenet_resnet50_fusebn | python _ce.py
+
+
+cudaid=${face_detection_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train.py --model_save_dir=output/ --data_dir=dataset/coco/ --max_iter=10 --enable_ce --pretrained_model=./imagenet_resnet50_fusebn | python _ce.py
+
diff --git a/fluid/PaddleCV/faster_rcnn/README.md b/fluid/PaddleCV/faster_rcnn/README.md
index 0c5c15245c6aa03c5861853946379093701c7475..0a5f68c34adda54ba0e27f44f16c18cafe057830 100644
--- a/fluid/PaddleCV/faster_rcnn/README.md
+++ b/fluid/PaddleCV/faster_rcnn/README.md
@@ -38,18 +38,6 @@ Train the model on [MS-COCO dataset](http://cocodataset.org/#download), download
 
 ## Training
 
-After data preparation, one can start the training step by:
-
-    python train.py \
-       --model_save_dir=output/ \
-       --pretrained_model=${path_to_pretrain_model}
-       --data_dir=${path_to_data}
-
-- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
-- For more help on arguments:
-
-    python train.py --help
-
 **download the pre-trained model:** This sample provides Resnet-50 pre-trained model which is converted from Caffe. The model fuses the parameters in batch normalization layer. One can download pre-trained model as:
 
     sh ./pretrained/download.sh
@@ -72,6 +60,18 @@ To train the model, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
     # not to install the COCO API into global site-packages
     python2 setup.py install --user
 
+After data preparation, one can start the training step by:
+
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model}
+       --data_dir=${path_to_data}
+
+- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
+- For more help on arguments:
+
+    python train.py --help
+
 **data reader introduction:**
 
 * Data reader is defined in `reader.py`.
@@ -128,7 +128,7 @@ Inference is used to get prediction score or image features based on trained mod
     python infer.py \
        --dataset=coco2017 \
         --pretrained_model=${path_to_pretrain_model}  \
-        --image_path=data/COCO17/val2017/  \
+        --image_path=dataset/coco/val2017/  \
         --image_name=000000000139.jpg \
         --draw_threshold=0.6
 
diff --git a/fluid/PaddleCV/faster_rcnn/README_cn.md b/fluid/PaddleCV/faster_rcnn/README_cn.md
index a0f9fac8e9ea60bb7b2a34fe032554aae8fbcf7b..29adfcfd274b82f2ddaba1894be6ad1c7ece1e6a 100644
--- a/fluid/PaddleCV/faster_rcnn/README_cn.md
+++ b/fluid/PaddleCV/faster_rcnn/README_cn.md
@@ -37,18 +37,6 @@ Faster RCNN 目标检测模型
 
 ## 模型训练
 
-数据准备完毕后，可以通过如下的方式启动训练：
-
-    python train.py \
-       --model_save_dir=output/ \
-       --pretrained_model=${path_to_pretrain_model}
-       --data_dir=${path_to_data}
-
-- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
-- 可选参数见：
-
-    python train.py --help
-
 **下载预训练模型：** 本示例提供Resnet-50预训练模型，该模性转换自Caffe，并对批标准化层(Batch Normalization Layer)进行参数融合。采用如下命令下载预训练模型：
 
     sh ./pretrained/download.sh
@@ -71,6 +59,18 @@ Faster RCNN 目标检测模型
     # not to install the COCO API into global site-packages
     python2 setup.py install --user
 
+数据准备完毕后，可以通过如下的方式启动训练：
+
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model}
+       --data_dir=${path_to_data}
+
+- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
+- 可选参数见：
+
+    python train.py --help
+
 **数据读取器说明：** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`，若长边大于`max_size`, 则再次将长边等比例缩放至`max_size`。在训练阶段，对图像采用水平翻转。支持将同一个batch内的图像padding为相同尺寸。
 
 **模型设置：**
@@ -124,7 +124,7 @@ Faster RCNN 目标检测模型
     python infer.py \
        --dataset=coco2017 \
         --pretrained_model=${path_to_pretrain_model}  \
-        --image_path=data/COCO17/val2017/  \
+        --image_path=dataset/coco/val2017/  \
         --image_name=000000000139.jpg \
         --draw_threshold=0.6
 
diff --git a/fluid/PaddleCV/faster_rcnn/__init__.py b/fluid/PaddleCV/faster_rcnn/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fluid/PaddleCV/faster_rcnn/_ce.py b/fluid/PaddleCV/faster_rcnn/_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d5850fd22c3d023eb866fa474b6f6f586ca326e
--- /dev/null
+++ b/fluid/PaddleCV/faster_rcnn/_ce.py
@@ -0,0 +1,61 @@
+# this file is only used for continuous evaluation test!
+
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+
+
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.08, 0, actived=True)
+train_loss_card1_kpi = CostKpi('train_loss_card1', 0.08, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.08, 0, actived=True)
+train_loss_card4_kpi = CostKpi('train_loss_card4', 0.08, 0)
+
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_loss_card4_kpi,
+        ]
+
+
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+
+    The suggestion:
+
+    each line in the log should be key, value, for example:
+
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+
+
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+
+
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
diff --git a/fluid/PaddleCV/faster_rcnn/data_utils.py b/fluid/PaddleCV/faster_rcnn/data_utils.py
index 12858f1b1037ed0f8a0e47d7f1b0c2490c767623..4d63b62646f199cfae6d151781aed40786fcad0d 100644
--- a/fluid/PaddleCV/faster_rcnn/data_utils.py
+++ b/fluid/PaddleCV/faster_rcnn/data_utils.py
@@ -28,6 +28,7 @@ from __future__ import unicode_literals
 import cv2
 import numpy as np
 from config import cfg
+import os
 
 
 def get_image_blob(roidb, mode):
@@ -43,8 +44,11 @@ def get_image_blob(roidb, mode):
         target_size = cfg.TEST.scales[0]
         max_size = cfg.TEST.max_size
     im = cv2.imread(roidb['image'])
-    assert im is not None, \
-        'Failed to read image \'{}\''.format(roidb['image'])
+    try:
+        assert im is not None
+    except AssertionError as e:
+        print('Failed to read image \'{}\''.format(roidb['image']))
+        os._exit(0)
     if roidb['flipped']:
         im = im[:, ::-1, :]
     im, im_scale = prep_im_for_blob(im, cfg.pixel_means, target_size, max_size)
diff --git a/fluid/PaddleCV/faster_rcnn/train.py b/fluid/PaddleCV/faster_rcnn/train.py
index 1b18f85d8b18c74809bac5a8c7c2b4b0d5e0e232..b840d2855c09e1df91601d30df1503a6003aeef5 100644
--- a/fluid/PaddleCV/faster_rcnn/train.py
+++ b/fluid/PaddleCV/faster_rcnn/train.py
@@ -35,7 +35,7 @@ def train():
     learning_rate = cfg.learning_rate
     image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
 
-    if cfg.debug:
+    if cfg.debug or cfg.enable_ce:
         fluid.default_startup_program().random_seed = 1000
         fluid.default_main_program().random_seed = 1000
         import random
@@ -46,11 +46,14 @@ def train():
     devices_num = len(devices.split(","))
     total_batch_size = devices_num * cfg.TRAIN.im_per_batch
 
+    use_random = True
+    if cfg.enable_ce:
+        use_random = False
     model = model_builder.FasterRCNN(
         add_conv_body_func=resnet.add_ResNet50_conv4_body,
         add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
         use_pyreader=cfg.use_pyreader,
-        use_random=True)
+        use_random=use_random)
     model.build_model(image_shape)
     loss_cls, loss_bbox, rpn_cls_loss, rpn_reg_loss = model.loss()
     loss_cls.persistable = True
@@ -92,16 +95,19 @@ def train():
         train_exe = fluid.ParallelExecutor(
             use_cuda=bool(cfg.use_gpu), loss_name=loss.name)
 
+    shuffle = True
+    if cfg.enable_ce:
+        shuffle = False
     if cfg.use_pyreader:
         train_reader = reader.train(
             batch_size=cfg.TRAIN.im_per_batch,
             total_batch_size=total_batch_size,
             padding_total=cfg.TRAIN.padding_minibatch,
-            shuffle=True)
+            shuffle=shuffle)
         py_reader = model.py_reader
         py_reader.decorate_paddle_reader(train_reader)
     else:
-        train_reader = reader.train(batch_size=total_batch_size, shuffle=True)
+        train_reader = reader.train(batch_size=total_batch_size, shuffle=shuffle)
         feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
 
     def save_model(postfix):
@@ -118,6 +124,8 @@ def train():
         try:
             start_time = time.time()
             prev_start_time = start_time
+            total_time = 0
+            last_loss = 0
             every_pass_loss = []
             for iter_id in range(cfg.max_iter):
                 prev_start_time = start_time
@@ -131,9 +139,23 @@ def train():
                     iter_id, lr[0],
                     smoothed_loss.get_median_value(
                     ), start_time - prev_start_time))
+                end_time = time.time()
+                total_time += end_time - start_time
+                last_loss = np.mean(np.array(losses[0]))
+
                 sys.stdout.flush()
                 if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0:
                     save_model("model_iter{}".format(iter_id))
+            # only for ce
+            if cfg.enable_ce:
+                gpu_num = devices_num
+                epoch_idx = iter_id + 1
+                loss = last_loss
+                print("kpis\teach_pass_duration_card%s\t%s" %
+                        (gpu_num, total_time / epoch_idx))
+                print("kpis\ttrain_loss_card%s\t%s" %
+                        (gpu_num, loss))
+
         except fluid.core.EOFException:
             py_reader.reset()
         return np.mean(every_pass_loss)
@@ -142,6 +164,8 @@ def train():
         start_time = time.time()
         prev_start_time = start_time
         start = start_time
+        total_time = 0
+        last_loss = 0
         every_pass_loss = []
         smoothed_loss = SmoothedValue(cfg.log_window)
         for iter_id, data in enumerate(train_reader()):
@@ -154,6 +178,9 @@ def train():
             smoothed_loss.add_value(loss_v)
             lr = np.array(fluid.global_scope().find_var('learning_rate')
                           .get_tensor())
+            end_time = time.time()
+            total_time += end_time - start_time
+            last_loss = loss_v
             print("Iter {:d}, lr {:.6f}, loss {:.6f}, time {:.5f}".format(
                 iter_id, lr[0],
                 smoothed_loss.get_median_value(), start_time - prev_start_time))
@@ -162,6 +189,16 @@ def train():
                 save_model("model_iter{}".format(iter_id))
             if (iter_id + 1) == cfg.max_iter:
                 break
+        # only for ce
+        if cfg.enable_ce:
+            gpu_num = devices_num
+            epoch_idx = iter_id + 1
+            loss = last_loss
+            print("kpis\teach_pass_duration_card%s\t%s" %
+                    (gpu_num, total_time / epoch_idx))
+            print("kpis\ttrain_loss_card%s\t%s" %
+                    (gpu_num, loss))
+
         return np.mean(every_pass_loss)
 
     if cfg.use_pyreader:
diff --git a/fluid/PaddleCV/faster_rcnn/utility.py b/fluid/PaddleCV/faster_rcnn/utility.py
index 12a208823482a6904e4f0ee0dcae84fa38f7cf37..f428de4c17ac9a6bd1600f52267d6718426adc78 100644
--- a/fluid/PaddleCV/faster_rcnn/utility.py
+++ b/fluid/PaddleCV/faster_rcnn/utility.py
@@ -98,7 +98,7 @@ def parse_args():
     add_arg('pretrained_model', str,    'imagenet_resnet50_fusebn', "The init model path.")
     add_arg('dataset',          str,   'coco2017',  "coco2014, coco2017.")
     add_arg('class_num',        int,   81,          "Class number.")
-    add_arg('data_dir',         str,   'data/COCO17',        "The data root path.")
+    add_arg('data_dir',         str,   'dataset/coco',        "The data root path.")
     add_arg('use_pyreader',     bool,   True,           "Use pyreader.")
     add_arg('use_profile',         bool,   False,       "Whether use profiler.")
     add_arg('padding_minibatch',bool,   False,
@@ -127,8 +127,11 @@ def parse_args():
     add_arg('debug',            bool,   False,   "Debug mode")
     # SINGLE EVAL AND DRAW
     add_arg('draw_threshold',  float, 0.8,    "Confidence threshold to draw bbox.")
-    add_arg('image_path',       str,   'data/COCO17/val2017',  "The image path used to inference and visualize.")
+    add_arg('image_path',       str,   'dataset/coco/val2017',  "The image path used to inference and visualize.")
     add_arg('image_name',        str,    '',       "The single image used to inference and visualize.")
+    # ce
+    parser.add_argument(
+            '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
     # yapf: enable
     args = parser.parse_args()
     file_name = sys.argv[0]
diff --git a/fluid/PaddleCV/gan/c_gan/.run_ce.sh b/fluid/PaddleCV/gan/c_gan/.run_ce.sh
index 7dee419d90a9719f6c9790f0ffc0b50c69870815..eb43acc363e66e10f6ae4052e426dfa25b2d3e8f 100755
--- a/fluid/PaddleCV/gan/c_gan/.run_ce.sh
+++ b/fluid/PaddleCV/gan/c_gan/.run_ce.sh
@@ -3,7 +3,7 @@
 # This file is only used for continuous evaluation.
 export FLAGS_cudnn_deterministic=True
 export ce_mode=1
-(CUDA_VISIBLE_DEVICES=6 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
-CUDA_VISIBLE_DEVICES=7 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
+(CUDA_VISIBLE_DEVICES=2 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
+CUDA_VISIBLE_DEVICES=3 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
 
 
diff --git a/fluid/PaddleCV/gan/c_gan/c_gan.py b/fluid/PaddleCV/gan/c_gan/c_gan.py
index 18c6e5df232d5077126001b0fe17ca098c8e6c4b..ebf5f87fda33375e045022aca860e81b752ffaf5 100644
--- a/fluid/PaddleCV/gan/c_gan/c_gan.py
+++ b/fluid/PaddleCV/gan/c_gan/c_gan.py
@@ -165,7 +165,8 @@ def train(args):
                           'conditions': conditions_data},
                     fetch_list={dg_loss})[0][0]
                 losses[1].append(dg_loss_n)
-            t_time += (time.time() - s_time)
+            batch_time = time.time() - s_time
+            t_time += batch_time
 
             
 
@@ -180,8 +181,9 @@ def train(args):
                     fetch_list={g_img})[0]
                 total_images = np.concatenate([real_image, generated_images])
                 fig = plot(total_images)
-                msg = "Epoch ID={0}\n Batch ID={1}\n D-Loss={2}\n DG-Loss={3}\n gen={4}".format(
-                    pass_id, batch_id, d_loss_n, dg_loss_n, check(generated_images))
+                msg = "Epoch ID={0}\n Batch ID={1}\n D-Loss={2}\n DG-Loss={3}\n gen={4}\n " \
+                      "Batch_time_cost={5:.2f}".format(
+                    pass_id, batch_id, d_loss_n, dg_loss_n, check(generated_images), batch_time)
                 print(msg)
                 plt.title(msg)
                 plt.savefig(
diff --git a/fluid/PaddleCV/gan/cycle_gan/train.py b/fluid/PaddleCV/gan/cycle_gan/train.py
index 1cc2fa090b3c35d61071f7ce1b7caedbd18226f9..ea7887570f8e4063bb036d67ecf58c45902fd3f2 100644
--- a/fluid/PaddleCV/gan/cycle_gan/train.py
+++ b/fluid/PaddleCV/gan/cycle_gan/train.py
@@ -187,10 +187,12 @@ def train(args):
                 fetch_list=[d_A_trainer.d_loss_A],
                 feed={"input_A": tensor_A,
                       "fake_pool_A": fake_pool_A})[0]
-            t_time += (time.time() - s_time)
-            print("epoch{}; batch{}; g_A_loss: {}; d_B_loss: {}; g_B_loss: {}; d_A_loss: {};".format(
+            batch_time = time.time() - s_time
+            t_time += batch_time
+            print("epoch{}; batch{}; g_A_loss: {}; d_B_loss: {}; g_B_loss: {}; d_A_loss: {}; "
+                  "Batch_time_cost: {:.2f}".format(
                 epoch, batch_id, g_A_loss[0], d_B_loss[0], g_B_loss[0],
-                d_A_loss[0]))
+                d_A_loss[0], batch_time))
             losses[0].append(g_A_loss[0])
             losses[1].append(d_A_loss[0])
             sys.stdout.flush()
diff --git a/fluid/PaddleCV/image_classification/.run_ce.sh b/fluid/PaddleCV/image_classification/.run_ce.sh
index 9ba9a4c2c6779694f0e87e12ca85b59afa33f1c0..cc0d894a634bc0add12fd83840990eacf77382cc 100755
--- a/fluid/PaddleCV/image_classification/.run_ce.sh
+++ b/fluid/PaddleCV/image_classification/.run_ce.sh
@@ -7,6 +7,7 @@ cudaid=${object_detection_cudaid:=0}
 export CUDA_VISIBLE_DEVICES=$cudaid
 python train.py --batch_size=${BATCH_SIZE} --num_epochs=5 --enable_ce=True --lr_strategy=cosine_decay | python _ce.py
 
+BATCH_SIZE=224
 cudaid=${object_detection_cudaid_m:=0, 1, 2, 3}
 export CUDA_VISIBLE_DEVICES=$cudaid
 python train.py --batch_size=${BATCH_SIZE} --num_epochs=5 --enable_ce=True --lr_strategy=cosine_decay | python _ce.py
diff --git a/fluid/PaddleCV/image_classification/README.md b/fluid/PaddleCV/image_classification/README.md
index 1de071c0de37d4f265a8dabab2e97b5cda3ecc4c..57dc26005334eff06528dcb22a99c17659a61d2c 100644
--- a/fluid/PaddleCV/image_classification/README.md
+++ b/fluid/PaddleCV/image_classification/README.md
@@ -6,6 +6,7 @@ Image classification, which is an important field of computer vision, is to clas
 - [Installation](#installation)
 - [Data preparation](#data-preparation)
 - [Training a model with flexible parameters](#training-a-model)
+- [Using Mixed-Precision Training](#using-mixed-precision-training)
 - [Finetuning](#finetuning)
 - [Evaluation](#evaluation)
 - [Inference](#inference)
@@ -112,6 +113,13 @@ The error rate curves of AlexNet, ResNet50 and SE-ResNeXt-50 are shown in the fi
 Training and validation Curves
 </p>
 
+
+## Using Mixed-Precision Training
+
+You may add `--fp16 1` to start train using mixed precisioin training, which the training process will use float16 and the output model ("master" parameters) is saved as float32. You also may need to pass `--scale_loss` to overcome accuracy issues, usually `--scale_loss 8.0` will do.
+
+Note that currently `--fp16` can not use together with `--with_mem_opt`, so pass `--with_mem_opt 0` to disable memory optimization pass.
+
 ## Finetuning
 
 Finetuning is to finetune model weights in a specific task by loading pretrained weights. After initializing ```path_to_pretrain_model```, one can finetune a model as:
@@ -196,10 +204,19 @@ Models are trained by starting with learning rate ```0.1``` and decaying it by `
 |model | top-1/top-5 accuracy(PIL)| top-1/top-5 accuracy(CV2) |
 |- |:-: |:-:|
 |[AlexNet](http://paddle-imagenet-models-name.bj.bcebos.com/AlexNet_pretrained.zip) | 56.71%/79.18% | 55.88%/78.65% |
-|[VGG11](http://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretained.zip) | 68.92%/88.66% | 68.61%/88.60% |
+|[VGG11](https://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretrained.zip) | 69.22%/89.09% | 69.01%/88.90% |
+|[VGG13](https://paddle-imagenet-models-name.bj.bcebos.com/VGG13_pretrained.zip) | 70.14%/89.48% | 69.83%/89.13% |
+|[VGG16](https://paddle-imagenet-models-name.bj.bcebos.com/VGG16_pretrained.zip) | 72.08%/90.63% | 71.65%/90.57% |
+|[VGG19](https://paddle-imagenet-models-name.bj.bcebos.com/VGG19_pretrained.zip) | 72.56%/90.83% | 72.32%/90.98% |
 |[MobileNetV1](http://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV1_pretrained.zip) | 70.91%/89.54% | 70.51%/89.35% |
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.zip) | 76.35%/92.80% | 76.22%/92.92% |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.zip) | 77.49%/93.57% | 77.56%/93.64% |
+|[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.zip) | 78.12%/93.93% | 77.92%/93.87% |
+|[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNext50_32x4d_pretrained.zip) | 78.50%/94.01% | 78.44%/93.96% |
+|[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.zip) | 79.26%/94.22% | 79.12%/94.20% |
+
+
+
 
 - Released models: not specify parameter names
 
diff --git a/fluid/PaddleCV/image_classification/README_cn.md b/fluid/PaddleCV/image_classification/README_cn.md
index 3fc3b934e9b3bf1088409b270e2756101c539f18..7fc35a643e95dae8c2197a96e1fab44b60e458a4 100644
--- a/fluid/PaddleCV/image_classification/README_cn.md
+++ b/fluid/PaddleCV/image_classification/README_cn.md
@@ -109,6 +109,11 @@ End pass 9, train_loss 3.3745200634, train_acc1 0.303871691227, train_acc5 0.545
 训练集合与验证集合上的错误率曲线
 </p>
 
+## 混合精度训练
+
+可以通过开启`--fp16 1`启动混合精度训练，这样训练过程会使用float16数据，并输出float32的模型参数（"master"参数）。您可能需要同时传入`--scale_loss`来解决fp16训练的精度问题，通常传入`--scale_loss 8.0`即可。
+
+注意，目前混合精度训练不能和内存优化功能同时使用，所以需要传`--with_mem_opt 0`这个参数来禁用内存优化功能。
 
 ## 参数微调
 
@@ -194,10 +199,16 @@ Models包括两种模型：带有参数名字的模型，和不带有参数名
 |model | top-1/top-5 accuracy(PIL)| top-1/top-5 accuracy(CV2) |
 |- |:-: |:-:|
 |[AlexNet](http://paddle-imagenet-models-name.bj.bcebos.com/AlexNet_pretrained.zip) | 56.71%/79.18% | 55.88%/78.65% |
-|[VGG11](http://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretained.zip) | 68.92%/88.66% | 68.61%/88.60% |
+|[VGG11](https://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretrained.zip) | 69.22%/89.09% | 69.01%/88.90% |
+|[VGG13](https://paddle-imagenet-models-name.bj.bcebos.com/VGG13_pretrained.zip) | 70.14%/89.48% | 69.83%/89.13% |
+|[VGG16](https://paddle-imagenet-models-name.bj.bcebos.com/VGG16_pretrained.zip) | 72.08%/90.63% | 71.65%/90.57% |
+|[VGG19](https://paddle-imagenet-models-name.bj.bcebos.com/VGG19_pretrained.zip) | 72.56%/90.83% | 72.32%/90.98% |
 |[MobileNetV1](http://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV1_pretrained.zip) | 70.91%/89.54% | 70.51%/89.35% |
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.zip) | 76.35%/92.80% | 76.22%/92.92% |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.zip) | 77.49%/93.57% | 77.56%/93.64% |
+|[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.zip) | 78.12%/93.93% | 77.92%/93.87% |
+|[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNext50_32x4d_pretrained.zip) | 78.50%/94.01% | 78.44%/93.96% |
+|[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.zip) | 79.26%/94.22% | 79.12%/94.20% |
 
 - Released models: not specify parameter names
 
diff --git a/fluid/PaddleCV/image_classification/dist_train/README.md b/fluid/PaddleCV/image_classification/dist_train/README.md
index a595a540adfa770253909e432e99a27228d5f062..0b2729cce4fa2e0780b8db5f87da49a8e221c665 100644
--- a/fluid/PaddleCV/image_classification/dist_train/README.md
+++ b/fluid/PaddleCV/image_classification/dist_train/README.md
@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
 
 Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
 
-1. The entrypoint file is `dist_train.py`, some important flags are as follows:
+1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
 
-    - `model`, the model to run with, default is the fine tune model `DistResnet`.
-    - `batch_size`, the batch_size per device.
     - `update_method`, specify the update method, can choose from local, pserver or nccl2.
-    - `device`, use CPU or GPU device.
-    - `gpus`, the GPU device count that the process used.
+    - `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
+    - `start_test_pass`, when to start running tests.
+    - `num_threads`, how many threads will be used for ParallelExecutor.
+    - `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
+    - `async_mode`, do async training, defalt False.
+    - `reduce_strategy`, choose from "reduce", "allreduce".
 
     you can check out more details of the flags by `python dist_train.py --help`.
 
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
 
     We use the environment variable to distinguish the different training role of a distributed training job.
 
-    - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
-    - `PADDLE_TRAINERS`, the trainer count of a job.
-    - `PADDLE_CURRENT_IP`, the current instance IP.
-    - `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by ","  only be used with update_method is pserver.
-    - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
-    - `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
-    - `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
-
-### Parameter Server Mode
-
-In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
-
-1. launch parameter server process
-
-    ``` bash
-    PADDLE_TRAINING_ROLE=PSERVER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_CURRENT_IP=192.168.0.100 \
-    PADDLE_PSERVER_PORT=7164 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=CPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-
-1. launch trainer process
-
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    PADDLE_PSERVER_PORT=7164 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-
-### NCCL2 Collective Mode
-
-1. launch trainer process
-
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=nccl2 \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
+    - General envs:
+        - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
+        - `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
+        - `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
+    - Pserver mode:
+        - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
+        - `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
+    - NCCL2 mode:
+        - `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
+
+### Try Out Different Distributed Training Modes
+
+You can test if distributed training works on a single node before deploying to the "real" cluster.
+
+***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
+
+1. simply run `python dist_train.py` to start local training with default configuratioins.
+2. for pserver mode, run `bash run_ps_mode.sh` to start 2 pservers and 2 trainers, these 2 trainers
+   will use GPU 0 and 1 to simulate 2 workers.
+3. for nccl2 mode, run `bash run_nccl2_mode.sh` to start 2 workers.
+4. for local/distributed multi-process mode, run `run_mp_mode.sh` (this test use 4 GPUs).
 
 ### Visualize the Training Process
 
@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example
 the logs of ResNet50 is as follows:
 
 ``` text
-Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625]
-Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0]
-Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125]
-Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625]
-Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
-Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
-Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
-Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
+Pass 0, batch 30, loss 7.569439, acc1: 0.0125, acc5: 0.0125, avg batch time 0.1720
+Pass 0, batch 60, loss 7.027379, acc1: 0.0, acc5: 0.0, avg batch time 0.1551
+Pass 0, batch 90, loss 6.819984, acc1: 0.0, acc5: 0.0125, avg batch time 0.1492
+Pass 0, batch 120, loss 6.9076853, acc1: 0.0, acc5: 0.0125, avg batch time 0.1464
 ```
 
 The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training
diff --git a/fluid/PaddleCV/image_classification/dist_train/batch_merge.py b/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
new file mode 100644
index 0000000000000000000000000000000000000000..7215cd586cb8ecf95a11b19e43106ad4aaea8029
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
@@ -0,0 +1,42 @@
+import paddle.fluid as fluid
+
+def copyback_repeat_bn_params(main_prog):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    for vname in repeat_vars:
+        real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
+        orig_var = fluid.global_scope().find_var(vname).get_tensor()
+        orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
+
+def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    
+    for i in range(num_repeats):
+        for op in startup_prog.global_block().ops:
+            if op.type == "fill_constant":
+                for oname in op.output_arg_names:
+                    if oname in repeat_vars:
+                        var = startup_prog.global_block().var(oname)
+                        repeat_var_name = "%s.repeat.%d" % (oname, i)
+                        repeat_var = startup_prog.global_block().create_var(
+                            name=repeat_var_name,
+                            type=var.type,
+                            dtype=var.dtype,
+                            shape=var.shape,
+                            persistable=var.persistable
+                        )
+                        main_prog.global_block()._clone_variable(repeat_var)
+                        startup_prog.global_block().append_op(
+                            type="fill_constant",
+                            inputs={},
+                            outputs={"Out": repeat_var},
+                            attrs=op.all_attrs()
+                        )
+
diff --git a/fluid/PaddleCV/image_classification/dist_train/dist_train.py b/fluid/PaddleCV/image_classification/dist_train/dist_train.py
index 11e08aa89ccee3960f9fdf4751f89b4fdb7a2e7b..8b7d5c569100a2b3769f584311e35569f61cd13c 100644
--- a/fluid/PaddleCV/image_classification/dist_train/dist_train.py
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_train.py
@@ -16,6 +16,8 @@ import argparse
 import time
 import os
 import traceback
+import functools
+import subprocess
 
 import numpy as np
 
@@ -28,127 +30,121 @@ sys.path.append("..")
 import models
 import utils
 from reader import train, val
+from utility import add_arguments, print_arguments
+from batch_merge import copyback_repeat_bn_params, append_bn_repeat_init_op
+from dist_utils import pserver_prepare, nccl2_prepare
+from env import dist_env
 
 def parse_args():
-    parser = argparse.ArgumentParser('Distributed Image Classification Training.')
-    parser.add_argument(
-        '--model',
-        type=str,
-        default='DistResNet',
-        help='The model to run.')
-    parser.add_argument(
-        '--batch_size', type=int, default=32, help='The minibatch size per device.')
-    parser.add_argument(
-        '--multi_batch_repeat', type=int, default=1, help='Batch merge repeats.')
-    parser.add_argument(
-        '--learning_rate', type=float, default=0.1, help='The learning rate.')
-    parser.add_argument(
-        '--pass_num', type=int, default=90, help='The number of passes.')
-    parser.add_argument(
-        '--data_format',
-        type=str,
-        default='NCHW',
-        choices=['NCHW', 'NHWC'],
-        help='The data data_format, now only support NCHW.')
-    parser.add_argument(
-        '--device',
-        type=str,
-        default='GPU',
-        choices=['CPU', 'GPU'],
-        help='The device type.')
-    parser.add_argument(
-        '--gpus',
-        type=int,
-        default=1,
-        help='If gpus > 1, will use ParallelExecutor to run, else use Executor.')
-    parser.add_argument(
-        '--cpus',
-        type=int,
-        default=1,
-        help='If cpus > 1, will set ParallelExecutor to use multiple threads.')
-    parser.add_argument(
-        '--no_test',
-        action='store_true',
-        help='If set, do not test the testset during training.')
-    parser.add_argument(
-        '--memory_optimize',
-        action='store_true',
-        help='If set, optimize runtime memory before start.')
-    parser.add_argument(
-        '--update_method',
-        type=str,
-        default='local',
-        choices=['local', 'pserver', 'nccl2'],
-        help='Choose parameter update method, can be local, pserver, nccl2.')
-    parser.add_argument(
-        '--no_split_var',
-        action='store_true',
-        default=False,
-        help='Whether split variables into blocks when update_method is pserver')
-    parser.add_argument(
-        '--async_mode',
-        action='store_true',
-        default=False,
-        help='Whether start pserver in async mode to support ASGD')
-    parser.add_argument(
-        '--reduce_strategy',
-        type=str,
-        choices=['reduce', 'all_reduce'],
-        default='all_reduce',
-        help='Specify the reduce strategy, can be reduce, all_reduce')
-    parser.add_argument(
-        '--data_dir',
-        type=str,
-        default="../data/ILSVRC2012",
-        help="The ImageNet dataset root dir."
-    )
+    parser = argparse.ArgumentParser(description=__doc__)
+    add_arg = functools.partial(add_arguments, argparser=parser)
+    # yapf: disable
+    add_arg('batch_size',       int,   256,                  "Minibatch size.")
+    add_arg('use_gpu',          bool,  True,                 "Whether to use GPU or not.")
+    add_arg('total_images',     int,   1281167,              "Training image number.")
+    add_arg('num_epochs',       int,   120,                  "number of epochs.")
+    add_arg('class_dim',        int,   1000,                 "Class number.")
+    add_arg('image_shape',      str,   "3,224,224",          "input image size")
+    add_arg('model_save_dir',   str,   "output",             "model save directory")
+    add_arg('with_mem_opt',     bool,  False,                 "Whether to use memory optimization or not.")
+    add_arg('pretrained_model', str,   None,                 "Whether to use pretrained model.")
+    add_arg('checkpoint',       str,   None,                 "Whether to resume checkpoint.")
+    add_arg('lr',               float, 0.1,                  "set learning rate.")
+    add_arg('lr_strategy',      str,   "piecewise_decay",    "Set the learning rate decay strategy.")
+    add_arg('model',            str,   "DistResNet",         "Set the network to use.")
+    add_arg('enable_ce',        bool,  False,                "If set True, enable continuous evaluation job.")
+    add_arg('data_dir',         str,   "./data/ILSVRC2012",  "The ImageNet dataset root dir.")
+    add_arg('model_category',   str,   "models",             "Whether to use models_name or not, valid value:'models','models_name'" )
+    add_arg('fp16',             bool,  False,                "Enable half precision training with fp16." )
+    add_arg('scale_loss',       float, 1.0,                  "Scale loss for fp16." )
+    # for distributed
+    add_arg('update_method',      str,  "local",            "Can be local, pserver, nccl2.")
+    add_arg('multi_batch_repeat', int,  1,                  "Batch merge repeats.")
+    add_arg('start_test_pass',    int,  0,                  "Start test after x passes.")
+    add_arg('num_threads',        int,  8,                  "Use num_threads to run the fluid program.")
+    add_arg('split_var',          bool, True,               "Split params on pserver.")
+    add_arg('async_mode',         bool, False,              "Async distributed training, only for pserver mode.")
+    add_arg('reduce_strategy',    str,  "allreduce",        "Choose from reduce or allreduce.")
+    add_arg('skip_unbalanced_data', bool, False,            "Skip data not if data not balanced on nodes.")
+    # yapf: enable
     args = parser.parse_args()
     return args
 
-def get_model(args, is_train, main_prog, startup_prog):
-    pyreader = None
-    class_dim = 1000
-    if args.data_format == 'NCHW':
-        dshape = [3, 224, 224]
+def get_device_num():
+    if os.getenv("CPU_NUM"):
+        return int(os.getenv("CPU_NUM"))
+    visible_device = os.getenv('CUDA_VISIBLE_DEVICES')
+    if visible_device:
+        device_num = len(visible_device.split(','))
     else:
-        dshape = [224, 224, 3]
+        device_num = subprocess.check_output(['nvidia-smi', '-L']).decode().count('\n')
+    return device_num
+
+def prepare_reader(is_train, pyreader, args, pass_id=0):
     if is_train:
-        reader = train(data_dir=args.data_dir)
+        reader = train(data_dir=args.data_dir, pass_id_as_seed=pass_id)
     else:
         reader = val(data_dir=args.data_dir)
+    if is_train:
+        bs = args.batch_size / get_device_num()
+    else:
+        bs = 16
+    pyreader.decorate_paddle_reader(
+        paddle.batch(
+            reader,
+            batch_size=bs))
 
-    trainer_count = int(os.getenv("PADDLE_TRAINERS", "1"))
+def build_program(is_train, main_prog, startup_prog, args):
+    pyreader = None
+    class_dim = args.class_dim
+    image_shape = [int(m) for m in args.image_shape.split(",")]
+
+    trainer_count = args.dist_env["num_trainers"]
+    device_num_per_worker = get_device_num()
     with fluid.program_guard(main_prog, startup_prog):
+        pyreader = fluid.layers.py_reader(
+            capacity=16,
+            shapes=([-1] + image_shape, (-1, 1)),
+            dtypes=('float32', 'int64'),
+            name="train_reader" if is_train else "test_reader",
+            use_double_buffer=True)
         with fluid.unique_name.guard():
-            pyreader = fluid.layers.py_reader(
-                capacity=args.batch_size * args.gpus,
-                shapes=([-1] + dshape, (-1, 1)),
-                dtypes=('float32', 'int64'),
-                name="train_reader" if is_train else "test_reader",
-                use_double_buffer=True)
-            input, label = fluid.layers.read_file(pyreader)
+            image, label = fluid.layers.read_file(pyreader)
+            if args.fp16:
+                image = fluid.layers.cast(image, "float16")
             model_def = models.__dict__[args.model](layers=50, is_train=is_train)
-            predict = model_def.net(input, class_dim=class_dim)
-
-            cost = fluid.layers.cross_entropy(input=predict, label=label)
-            avg_cost = fluid.layers.mean(x=cost)
+            predict = model_def.net(image, class_dim=class_dim)
+            cost, pred = fluid.layers.softmax_with_cross_entropy(predict, label, return_softmax=True) 
+            if args.scale_loss > 1:
+                avg_cost = fluid.layers.mean(x=cost) * float(args.scale_loss)
+            else:
+                avg_cost = fluid.layers.mean(x=cost)
 
-            batch_acc1 = fluid.layers.accuracy(input=predict, label=label, k=1)
-            batch_acc5 = fluid.layers.accuracy(input=predict, label=label, k=5)
+            batch_acc1 = fluid.layers.accuracy(input=pred, label=label, k=1)
+            batch_acc5 = fluid.layers.accuracy(input=pred, label=label, k=5)
 
             optimizer = None
             if is_train:
-                start_lr = args.learning_rate
-                # n * worker * repeat
-                end_lr = args.learning_rate * trainer_count * args.multi_batch_repeat
-                total_images = 1281167 / trainer_count
-                step = int(total_images / (args.batch_size * args.gpus * args.multi_batch_repeat) + 1)
+                start_lr = args.lr
+                end_lr = args.lr * trainer_count * args.multi_batch_repeat
+                if os.getenv("FLAGS_selected_gpus"):
+                    # in multi process mode, "trainer_count" will be total devices
+                    # in the whole cluster, and we need to scale num_of nodes.
+                    end_lr *= device_num_per_worker
+
+                total_images = args.total_images / trainer_count
+                step = int(total_images / (args.batch_size * args.multi_batch_repeat) + 1)
                 warmup_steps = step * 5  # warmup 5 passes
                 epochs = [30, 60, 80]
                 bd = [step * e for e in epochs]
                 base_lr = end_lr
                 lr = []
                 lr = [base_lr * (0.1**i) for i in range(len(bd) + 1)]
+                print("start lr: %s, end lr: %s, decay boundaries: %s" % (
+                    start_lr,
+                    end_lr,
+                    bd
+                ))
 
                 # NOTE: we put weight decay in layers config, and remove
                 # weight decay on bn layers, so don't add weight decay in
@@ -159,151 +155,77 @@ def get_model(args, is_train, main_prog, startup_prog):
                             boundaries=bd, values=lr),
                         warmup_steps, start_lr, end_lr),
                     momentum=0.9)
-                optimizer.minimize(avg_cost)
+                if args.fp16:
+                    params_grads = optimizer.backward(avg_cost)
+                    master_params_grads = utils.create_master_params_grads(
+                        params_grads, main_prog, startup_prog, args.scale_loss)
+                    optimizer.apply_gradients(master_params_grads)
+                    utils.master_param_to_train_param(master_params_grads, params_grads, main_prog)
+                else:
+                    optimizer.minimize(avg_cost)
 
-    batched_reader = None
-    pyreader.decorate_paddle_reader(
-        paddle.batch(
-            reader,
-            batch_size=args.batch_size))
-
-    return avg_cost, optimizer, [batch_acc1,
-                                 batch_acc5], batched_reader, pyreader
-
-def append_nccl2_prepare(trainer_id, startup_prog):
-    trainer_id = int(os.getenv("PADDLE_TRAINER_ID"))
-    port = os.getenv("PADDLE_PSERVER_PORT")
-    worker_ips = os.getenv("PADDLE_TRAINER_IPS")
-    worker_endpoints = []
-    for ip in worker_ips.split(","):
-        worker_endpoints.append(':'.join([ip, port]))
-    current_endpoint = os.getenv("PADDLE_CURRENT_IP") + ":" + port
-    num_trainers = len(worker_endpoints)
-
-    config = fluid.DistributeTranspilerConfig()
-    config.mode = "nccl2"
-    t = fluid.DistributeTranspiler(config=config)
-    t.transpile(trainer_id, trainers=','.join(worker_endpoints),
-        current_endpoint=current_endpoint,
-        startup_program=startup_prog)
-    return num_trainers, trainer_id
-
-
-def dist_transpile(trainer_id, args, train_prog, startup_prog):
-    port = os.getenv("PADDLE_PSERVER_PORT", "6174")
-    pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
-    eplist = []
-    for ip in pserver_ips.split(","):
-        eplist.append(':'.join([ip, port]))
-    pserver_endpoints = ",".join(eplist)
-    trainers = int(os.getenv("PADDLE_TRAINERS"))
-    current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port
-    training_role = os.getenv("PADDLE_TRAINING_ROLE")
-
-    config = fluid.DistributeTranspilerConfig()
-    config.slice_var_up = not args.no_split_var
-    t = fluid.DistributeTranspiler(config=config)
-    t.transpile(
-        trainer_id,
-        program=train_prog,
-        pservers=pserver_endpoints,
-        trainers=trainers,
-        sync_mode=not args.async_mode,
-        startup_program=startup_prog)
-    if training_role == "PSERVER":
-        pserver_program = t.get_pserver_program(current_endpoint)
-        pserver_startup_program = t.get_startup_program(
-            current_endpoint, pserver_program, startup_program=startup_prog)
-        return pserver_program, pserver_startup_program
-    elif training_role == "TRAINER":
-        train_program = t.get_trainer_program()
-        return train_program, startup_prog
-    else:
-        raise ValueError(
-            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
-        )
-
-def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
-    repeat_vars = set()
-    for op in main_prog.global_block().ops:
-        if op.type == "batch_norm":
-            repeat_vars.add(op.input("Mean")[0])
-            repeat_vars.add(op.input("Variance")[0])
-    
-    for i in range(num_repeats):
-        for op in startup_prog.global_block().ops:
-            if op.type == "fill_constant":
-                for oname in op.output_arg_names:
-                    if oname in repeat_vars:
-                        var = startup_prog.global_block().var(oname)
-                        repeat_var_name = "%s.repeat.%d" % (oname, i)
-                        repeat_var = startup_prog.global_block().create_var(
-                            name=repeat_var_name,
-                            type=var.type,
-                            dtype=var.dtype,
-                            shape=var.shape,
-                            persistable=var.persistable
-                        )
-                        main_prog.global_block()._clone_variable(repeat_var)
-                        startup_prog.global_block().append_op(
-                            type="fill_constant",
-                            inputs={},
-                            outputs={"Out": repeat_var},
-                            attrs=op.all_attrs()
-                        )
-
-
-def copyback_repeat_bn_params(main_prog):
-    repeat_vars = set()
-    for op in main_prog.global_block().ops:
-        if op.type == "batch_norm":
-            repeat_vars.add(op.input("Mean")[0])
-            repeat_vars.add(op.input("Variance")[0])
-    for vname in repeat_vars:
-        real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
-        orig_var = fluid.global_scope().find_var(vname).get_tensor()
-        orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
-
-
-def test_single(exe, test_args, args, test_prog):
-    acc_evaluators = []
-    for i in xrange(len(test_args[2])):
-        acc_evaluators.append(fluid.metrics.Accuracy())
-
-    to_fetch = [v.name for v in test_args[2]]
-    test_args[4].start()
+    # prepare reader for current program
+    prepare_reader(is_train, pyreader, args)
+
+    return pyreader, avg_cost, batch_acc1, batch_acc5
+
+
+def test_single(exe, test_prog, args, pyreader, fetch_list):
+    acc1 = fluid.metrics.Accuracy()
+    acc5 = fluid.metrics.Accuracy()
+    test_losses = []
+    pyreader.start()
     while True:
         try:
-            acc_rets = exe.run(program=test_prog, fetch_list=to_fetch)
-            for i, e in enumerate(acc_evaluators):
-                e.update(
-                    value=np.array(acc_rets[i]), weight=args.batch_size)
-        except fluid.core.EOFException as eof:
-            test_args[4].reset()
+            acc_rets = exe.run(program=test_prog, fetch_list=fetch_list)
+            test_losses.append(acc_rets[0])
+            acc1.update(value=np.array(acc_rets[1]), weight=args.batch_size)
+            acc5.update(value=np.array(acc_rets[2]), weight=args.batch_size)
+        except fluid.core.EOFException:
+            pyreader.reset()
             break
+    test_avg_loss = np.mean(np.array(test_losses))
+    return test_avg_loss, np.mean(acc1.eval()), np.mean(acc5.eval())
+
+def run_pserver(train_prog, startup_prog):
+    server_exe = fluid.Executor(fluid.CPUPlace())
+    server_exe.run(startup_prog)
+    server_exe.run(train_prog)
 
-    return [e.eval() for e in acc_evaluators]
+def train_parallel(args):
+    train_prog = fluid.Program()
+    test_prog = fluid.Program()
+    startup_prog = fluid.Program()
 
+    train_pyreader, train_cost, train_acc1, train_acc5 = build_program(True, train_prog, startup_prog, args)
+    test_pyreader, test_cost, test_acc1, test_acc5 = build_program(False, test_prog, startup_prog, args)
 
-def train_parallel(train_args, test_args, args, train_prog, test_prog,
-                   startup_prog, num_trainers, trainer_id):
-    over_all_start = time.time()
-    place = core.CPUPlace() if args.device == 'CPU' else core.CUDAPlace(0)
+    if args.update_method == "pserver":
+        train_prog, startup_prog = pserver_prepare(args, train_prog, startup_prog)
+    elif args.update_method == "nccl2":
+        nccl2_prepare(args, startup_prog)
 
-    if args.update_method == "nccl2" and trainer_id == 0:
-        #FIXME(typhoonzero): wait other trainer to start listening
-        time.sleep(30)
+    if args.dist_env["training_role"] == "PSERVER":
+        run_pserver(train_prog, startup_prog)
+        exit(0)
+
+    if args.use_gpu:
+        # NOTE: for multi process mode: one process per GPU device.        
+        gpu_id = 0
+        if os.getenv("FLAGS_selected_gpus"):
+            gpu_id = int(os.getenv("FLAGS_selected_gpus"))
+    place = core.CUDAPlace(gpu_id) if args.use_gpu else core.CPUPlace()
 
     startup_exe = fluid.Executor(place)
     if args.multi_batch_repeat > 1:
         append_bn_repeat_init_op(train_prog, startup_prog, args.multi_batch_repeat)
     startup_exe.run(startup_prog)
+
     strategy = fluid.ExecutionStrategy()
-    strategy.num_threads = args.cpus
-    strategy.allow_op_delay = False
+    strategy.num_threads = args.num_threads
     build_strategy = fluid.BuildStrategy()
     if args.multi_batch_repeat > 1:
-        pass_builder = build_strategy._create_passes_from_strategy()
+        pass_builder = build_strategy._finalize_strategy_and_create_passes()
         mypass = pass_builder.insert_pass(
             len(pass_builder.all_passes()) - 2, "multi_batch_merge_pass")
         mypass.set_int("num_repeats", args.multi_batch_repeat)
@@ -314,73 +236,70 @@ def train_parallel(train_args, test_args, args, train_prog, test_prog,
         build_strategy.reduce_strategy = fluid.BuildStrategy(
         ).ReduceStrategy.AllReduce
 
-    avg_loss = train_args[0]
-
-    if args.update_method == "pserver":
+    if args.update_method == "pserver" or args.update_method == "local":
         # parameter server mode distributed training, merge
         # gradients on local server, do not initialize
         # ParallelExecutor with multi server all-reduce mode.
         num_trainers = 1
         trainer_id = 0
+    else:
+        num_trainers = args.dist_env["num_trainers"]
+        trainer_id = args.dist_env["trainer_id"]
 
     exe = fluid.ParallelExecutor(
         True,
-        avg_loss.name,
+        train_cost.name,
         main_program=train_prog,
         exec_strategy=strategy,
         build_strategy=build_strategy,
         num_trainers=num_trainers,
         trainer_id=trainer_id)
 
-    pyreader = train_args[4]
-    for pass_id in range(args.pass_num):
+    over_all_start = time.time()
+    fetch_list = [train_cost.name, train_acc1.name, train_acc5.name]
+    steps_per_pass = args.total_images / args.batch_size / args.dist_env["num_trainers"]
+    for pass_id in range(args.num_epochs):
         num_samples = 0
         start_time = time.time()
-        batch_id = 0
-        pyreader.start()
+        batch_id = 1
+        # use pass_id+1 as per pass global shuffle for distributed training
+        prepare_reader(True, train_pyreader, args, pass_id + 1)
+        train_pyreader.start()
         while True:
-            fetch_list = [avg_loss.name]
-            acc_name_list = [v.name for v in train_args[2]]
-            fetch_list.extend(acc_name_list)
             try:
                 if batch_id % 30 == 0:
                     fetch_ret = exe.run(fetch_list)
+                    fetched_data = [np.mean(np.array(d)) for d in fetch_ret]
+                    print("Pass %d, batch %d, loss %s, acc1: %s, acc5: %s, avg batch time %.4f" %
+                        (pass_id, batch_id, fetched_data[0], fetched_data[1],
+                         fetched_data[2], (time.time()-start_time) / batch_id))
                 else:
                     fetch_ret = exe.run([])
-            except fluid.core.EOFException as eof:
+            except fluid.core.EOFException:
                 break
-            except fluid.core.EnforceNotMet as ex:
+            except fluid.core.EnforceNotMet:
                 traceback.print_exc()
                 break
-            num_samples += args.batch_size * args.gpus
-
-            if batch_id % 30 == 0:
-                fetched_data = [np.mean(np.array(d)) for d in fetch_ret]
-                print("Pass %d, batch %d, loss %s, accucacys: %s" %
-                      (pass_id, batch_id, fetched_data[0], fetched_data[1:]))
+            num_samples += args.batch_size
             batch_id += 1
+            if args.skip_unbalanced_data and batch_id >= steps_per_pass:
+                break
 
         print_train_time(start_time, time.time(), num_samples)
-        pyreader.reset()
+        train_pyreader.reset()
 
-        if not args.no_test and test_args[2]:
+        if pass_id > args.start_test_pass:
             if args.multi_batch_repeat > 1:
                 copyback_repeat_bn_params(train_prog)
-            test_ret = test_single(startup_exe, test_args, args, test_prog)
-            print("Pass: %d, Test Accuracy: %s\n" %
-                  (pass_id, [np.mean(np.array(v)) for v in test_ret]))
+            test_fetch_list = [test_cost.name, test_acc1.name, test_acc5.name]
+            test_ret = test_single(startup_exe, test_prog, args, test_pyreader,test_fetch_list)
+            print("Pass: %d, Test Loss %s, test acc1: %s, test acc5: %s\n" %
+                  (pass_id, test_ret[0], test_ret[1], test_ret[2]))
 
     startup_exe.close()
     print("total train time: ", time.time() - over_all_start)
 
 
-def print_arguments(args):
-    print('----------- Configuration Arguments -----------')
-    for arg, value in sorted(six.iteritems(vars(args))):
-        print('%s: %s' % (arg, value))
-    print('------------------------------------------------')
-
-
 def print_train_time(start_time, end_time, num_samples):
     train_elapsed = end_time - start_time
     examples_per_sec = num_samples / train_elapsed
@@ -400,47 +319,8 @@ def main():
     args = parse_args()
     print_arguments(args)
     print_paddle_envs()
-
-    # the unique trainer id, starting from 0, needed by trainer
-    # only
-    num_trainers, trainer_id = (
-        1, int(os.getenv("PADDLE_TRAINER_ID", "0")))
-
-    train_prog = fluid.Program()
-    test_prog = fluid.Program()
-    startup_prog = fluid.Program()
-
-    train_args = list(get_model(args, True, train_prog, startup_prog))
-    test_args = list(get_model(args, False, test_prog, startup_prog))
-
-    all_args = [train_args, test_args, args]
-
-    if args.update_method == "pserver":
-        train_prog, startup_prog = dist_transpile(trainer_id, args, train_prog,
-                                                  startup_prog)
-        if not train_prog:
-            raise Exception(
-                "Must configure correct environments to run dist train.")
-        all_args.extend([train_prog, test_prog, startup_prog])
-        if os.getenv("PADDLE_TRAINING_ROLE") == "TRAINER":
-            all_args.extend([num_trainers, trainer_id])
-            train_parallel(*all_args)
-        elif os.getenv("PADDLE_TRAINING_ROLE") == "PSERVER":
-            # start pserver with Executor
-            server_exe = fluid.Executor(fluid.CPUPlace())
-            server_exe.run(startup_prog)
-            server_exe.run(train_prog)
-        exit(0)
-
-    # for other update methods, use default programs
-    all_args.extend([train_prog, test_prog, startup_prog])
-
-    if args.update_method == "nccl2":
-        num_trainers, trainer_id = append_nccl2_prepare(
-            trainer_id, startup_prog)
-
-    all_args.extend([num_trainers, trainer_id])
-    train_parallel(*all_args)
+    args.dist_env = dist_env()
+    train_parallel(args)
 
 if __name__ == "__main__":
     main()
diff --git a/fluid/PaddleCV/image_classification/dist_train/dist_utils.py b/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..51007273f717fe815d684aaae7c02b3d7245c4e7
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
@@ -0,0 +1,43 @@
+import os
+import paddle.fluid as fluid
+
+
+def nccl2_prepare(args, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+
+    envs = args.dist_env
+
+    t.transpile(envs["trainer_id"],
+        trainers=','.join(envs["trainer_endpoints"]),
+        current_endpoint=envs["current_endpoint"],
+        startup_program=startup_prog)
+
+
+def pserver_prepare(args, train_prog, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.slice_var_up = args.split_var
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    training_role = envs["training_role"]
+
+    t.transpile(
+        envs["trainer_id"],
+        program=train_prog,
+        pservers=envs["pserver_endpoints"],
+        trainers=envs["num_trainers"],
+        sync_mode=not args.async_mode,
+        startup_program=startup_prog)
+    if training_role == "PSERVER":
+        pserver_program = t.get_pserver_program(envs["current_endpoint"])
+        pserver_startup_program = t.get_startup_program(
+            envs["current_endpoint"], pserver_program, startup_program=startup_prog)
+        return pserver_program, pserver_startup_program
+    elif training_role == "TRAINER":
+        train_program = t.get_trainer_program()
+        return train_program, startup_prog
+    else:
+        raise ValueError(
+            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
+        )
diff --git a/fluid/PaddleCV/image_classification/dist_train/env.py b/fluid/PaddleCV/image_classification/dist_train/env.py
new file mode 100644
index 0000000000000000000000000000000000000000..f85297e4d3e24322176ad25ee34366f446e18896
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/env.py
@@ -0,0 +1,33 @@
+import os
+
+
+def dist_env():
+    """
+    Return a dict of all variable that distributed training may use.
+    NOTE: you may rewrite this function to suit your cluster environments.
+    """
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    num_trainers = 1
+    training_role = os.getenv("PADDLE_TRAINING_ROLE", "TRAINER")
+    assert(training_role == "PSERVER" or training_role == "TRAINER")
+
+    # - PADDLE_TRAINER_ENDPOINTS means nccl2 mode.
+    # - PADDLE_PSERVER_ENDPOINTS means pserver mode.
+    # - PADDLE_CURRENT_ENDPOINT means current process endpoint.
+    trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+    pserver_endpoints = os.getenv("PADDLE_PSERVER_ENDPOINTS")
+    current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+    if trainer_endpoints:
+        trainer_endpoints = trainer_endpoints.split(",")
+        num_trainers = len(trainer_endpoints)
+    elif pserver_endpoints:
+        num_trainers = int(os.getenv("PADDLE_TRAINERS_NUM"))
+    
+    return {
+        "trainer_id": trainer_id,
+        "num_trainers": num_trainers,
+        "current_endpoint": current_endpoint,
+        "training_role": training_role,
+        "pserver_endpoints": pserver_endpoints,
+        "trainer_endpoints": trainer_endpoints
+    }
diff --git a/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh b/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
new file mode 100755
index 0000000000000000000000000000000000000000..bf04e078284f02be0774209a599b839d0bbf20f5
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+# Test using 4 GPUs
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161,127.0.0.1:7162,127.0.0.1:7163"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="4"
+
+mkdir -p logs
+
+for i in {0..3}
+do
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:716${i}" \
+PADDLE_TRAINER_ID="${i}" \
+FLAGS_selected_gpus="${i}" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 --fp16 1 --scale_loss 8 &> logs/tr$i.log &
+done
diff --git a/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh b/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
new file mode 100755
index 0000000000000000000000000000000000000000..120a96647e093de6af362bd51d8e6942249db56f
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="2"
+
+mkdir -p logs
+
+# NOTE: set NCCL_P2P_DISABLE so that can run nccl2 distribute train on one node.
+
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr0.log &
+
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr1.log &
diff --git a/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh b/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
new file mode 100755
index 0000000000000000000000000000000000000000..99926afbb04e0bc2795a4fd7fd8b4ff58aefec31
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+
+export MODEL="DistResNet"
+export PADDLE_PSERVER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+export PADDLE_TRAINERS_NUM="2"
+
+mkdir -p logs
+
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps0.log &
+
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps1.log &
+
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr0.log &
+
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr1.log &
diff --git a/fluid/PaddleCV/image_classification/eval.py b/fluid/PaddleCV/image_classification/eval.py
index e117047d4c28254a2f2ee811a720d8ea1111dffe..ddce243fe1fcae81ee6064c7ff185fb8a045a402 100644
--- a/fluid/PaddleCV/image_classification/eval.py
+++ b/fluid/PaddleCV/image_classification/eval.py
@@ -7,12 +7,13 @@ import time
 import sys
 import paddle
 import paddle.fluid as fluid
-import models
+#import models
+import models_name as models
 #import reader_cv2 as reader
 import reader as reader
 import argparse
 import functools
-from models.learning_rate import cosine_decay
+from utils.learning_rate import cosine_decay
 from utility import add_arguments, print_arguments
 import math
 
@@ -48,7 +49,7 @@ def eval(args):
     # model definition
     model = models.__dict__[model_name]()
 
-    if model_name is "GoogleNet":
+    if model_name == "GoogleNet":
         out0, out1, out2 = model.net(input=image, class_dim=class_dim)
         cost0 = fluid.layers.cross_entropy(input=out0, label=label)
         cost1 = fluid.layers.cross_entropy(input=out1, label=label)
@@ -70,8 +71,10 @@ def eval(args):
 
     test_program = fluid.default_main_program().clone(for_test=True)
 
+    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
     if with_memory_optimization:
-        fluid.memory_optimize(fluid.default_main_program())
+        fluid.memory_optimize(
+            fluid.default_main_program(), skip_opt_set=set(fetch_list))
 
     place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
@@ -84,11 +87,9 @@ def eval(args):
 
         fluid.io.load_vars(exe, pretrained_model, predicate=if_exist)
 
-    val_reader = paddle.batch(reader.val(""), batch_size=args.batch_size)
+    val_reader = paddle.batch(reader.val(), batch_size=args.batch_size)
     feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
 
-    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
-
     test_info = [[], [], []]
     cnt = 0
     for batch_id, data in enumerate(val_reader()):
diff --git a/fluid/PaddleCV/image_classification/infer.py b/fluid/PaddleCV/image_classification/infer.py
index 19d204a1a21fde57f4c9b28e0c61cb9fd02edc3c..e89c08d923cdc37596c76dc7146a2666b719844d 100644
--- a/fluid/PaddleCV/image_classification/infer.py
+++ b/fluid/PaddleCV/image_classification/infer.py
@@ -11,7 +11,6 @@ import models
 import reader
 import argparse
 import functools
-from models.learning_rate import cosine_decay
 from utility import add_arguments, print_arguments
 import math
 
@@ -44,7 +43,6 @@ def infer(args):
 
     # model definition
     model = models.__dict__[model_name]()
-
     if model_name is "GoogleNet":
         out, _, _ = model.net(input=image, class_dim=class_dim)
     else:
@@ -52,8 +50,10 @@ def infer(args):
 
     test_program = fluid.default_main_program().clone(for_test=True)
 
+    fetch_list = [out.name]
     if with_memory_optimization:
-        fluid.memory_optimize(fluid.default_main_program())
+        fluid.memory_optimize(
+            fluid.default_main_program(), skip_opt_set=set(fetch_list))
 
     place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
@@ -70,8 +70,6 @@ def infer(args):
     test_reader = paddle.batch(reader.test(), batch_size=test_batch_size)
     feeder = fluid.DataFeeder(place=place, feed_list=[image])
 
-    fetch_list = [out.name]
-
     TOPK = 1
     for batch_id, data in enumerate(test_reader()):
         result = exe.run(test_program,
diff --git a/fluid/PaddleCV/image_classification/models/alexnet.py b/fluid/PaddleCV/image_classification/models/alexnet.py
index 3e0eab2dee1d2f2e8d3cb2e8c12a3504a1e7c0e5..abe3b92965b1c16312e4ddf68809f6a4c93183fa 100644
--- a/fluid/PaddleCV/image_classification/models/alexnet.py
+++ b/fluid/PaddleCV/image_classification/models/alexnet.py
@@ -142,7 +142,6 @@ class AlexNet():
         out = fluid.layers.fc(
             input=fc7,
             size=class_dim,
-            act='softmax',
             bias_attr=fluid.param_attr.ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv)),
             param_attr=fluid.param_attr.ParamAttr(
diff --git a/fluid/PaddleCV/image_classification/models/dpn.py b/fluid/PaddleCV/image_classification/models/dpn.py
index ca49898fb76d10344e0b847b3631b194192d7e5e..316e96ac2cebd2dec4a60bf1748ed321fa651590 100644
--- a/fluid/PaddleCV/image_classification/models/dpn.py
+++ b/fluid/PaddleCV/image_classification/models/dpn.py
@@ -94,7 +94,6 @@ class DPN(object):
             initializer=fluid.initializer.Uniform(-stdv, stdv))
         fc6 = fluid.layers.fc(input=pool5,
                               size=class_dim,
-                              act='softmax',
                               param_attr=param_attr)
 
         return fc6
diff --git a/fluid/PaddleCV/image_classification/models/inception_v4.py b/fluid/PaddleCV/image_classification/models/inception_v4.py
index d3a80a20500f365166d50a0cf222613d0427354f..1520375477ade6e61f0a5584278b13e40ab541eb 100644
--- a/fluid/PaddleCV/image_classification/models/inception_v4.py
+++ b/fluid/PaddleCV/image_classification/models/inception_v4.py
@@ -47,7 +47,6 @@ class InceptionV4():
         out = fluid.layers.fc(
             input=drop,
             size=class_dim,
-            act='softmax',
             param_attr=fluid.param_attr.ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv)))
         return out
diff --git a/fluid/PaddleCV/image_classification/models/mobilenet.py b/fluid/PaddleCV/image_classification/models/mobilenet.py
index f3554734768d5bbec96dac2443b48389d235da91..d0b419e8b4083104ba529c9f886284aa724953e6 100644
--- a/fluid/PaddleCV/image_classification/models/mobilenet.py
+++ b/fluid/PaddleCV/image_classification/models/mobilenet.py
@@ -120,7 +120,6 @@ class MobileNet():
 
         output = fluid.layers.fc(input=input,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(initializer=MSRA()))
         return output
 
diff --git a/fluid/PaddleCV/image_classification/models/mobilenet_v2.py b/fluid/PaddleCV/image_classification/models/mobilenet_v2.py
index 9ec118f2605e5aa4815668add1d1e2b3dae7cf7c..c219b1bf5a7260fbb07627bc3fa039f4b2833092 100644
--- a/fluid/PaddleCV/image_classification/models/mobilenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models/mobilenet_v2.py
@@ -73,7 +73,6 @@ class MobileNetV2():
 
         output = fluid.layers.fc(input=input,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(initializer=MSRA()))
         return output
 
diff --git a/fluid/PaddleCV/image_classification/models/resnet.py b/fluid/PaddleCV/image_classification/models/resnet.py
index 75c7b750541c60821a624f95a2ab56d2890fcb25..def99db6d84673b77582cf93374f4cb2f00e9ac5 100644
--- a/fluid/PaddleCV/image_classification/models/resnet.py
+++ b/fluid/PaddleCV/image_classification/models/resnet.py
@@ -60,7 +60,6 @@ class ResNet():
         stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
         out = fluid.layers.fc(input=pool,
                               size=class_dim,
-                              act='softmax',
                               param_attr=fluid.param_attr.ParamAttr(
                                   initializer=fluid.initializer.Uniform(-stdv,
                                                                         stdv)))
diff --git a/fluid/PaddleCV/image_classification/models/resnet_dist.py b/fluid/PaddleCV/image_classification/models/resnet_dist.py
index 2dab3e6111d8d02df44233c0440a8ea9ce74faa5..3420d790c25534b4a73ea660b2d880ff899ee62f 100644
--- a/fluid/PaddleCV/image_classification/models/resnet_dist.py
+++ b/fluid/PaddleCV/image_classification/models/resnet_dist.py
@@ -14,8 +14,9 @@ train_parameters = {
     "learning_strategy": {
         "name": "piecewise_decay",
         "batch_size": 256,
-        "epochs": [30, 60, 90],
-        "steps": [0.1, 0.01, 0.001, 0.0001]
+        "epochs": [30, 60, 80],
+        "steps": [0.1, 0.01, 0.001, 0.0001],
+        "warmup_passes": 5
     }
 }
 
@@ -62,7 +63,6 @@ class DistResNet():
         stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
         out = fluid.layers.fc(input=pool,
                               size=class_dim,
-                              act='softmax',
                               param_attr=fluid.param_attr.ParamAttr(
                                   initializer=fluid.initializer.Uniform(-stdv,
                                                                         stdv),
@@ -119,3 +119,4 @@ class DistResNet():
         short = self.shortcut(input, num_filters * 4, stride)
 
         return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
+
diff --git a/fluid/PaddleCV/image_classification/models/se_resnext.py b/fluid/PaddleCV/image_classification/models/se_resnext.py
index 023046a5db506d0111e246b03810b02c70f3e1b8..ac50bd87b5070000a018949e777a897427c3e5a5 100644
--- a/fluid/PaddleCV/image_classification/models/se_resnext.py
+++ b/fluid/PaddleCV/image_classification/models/se_resnext.py
@@ -110,7 +110,6 @@ class SE_ResNeXt():
         stdv = 1.0 / math.sqrt(drop.shape[1] * 1.0)
         out = fluid.layers.fc(input=drop,
                               size=class_dim,
-                              act='softmax',
                               param_attr=fluid.param_attr.ParamAttr(
                                   initializer=fluid.initializer.Uniform(-stdv,
                                                                         stdv)))
diff --git a/fluid/PaddleCV/image_classification/models/shufflenet_v2.py b/fluid/PaddleCV/image_classification/models/shufflenet_v2.py
index bc61508e57fe0e8aec743b4a389c75bac5b3bd44..6db88aa769dd6b3b3e2987fcac6d8054319a2a56 100644
--- a/fluid/PaddleCV/image_classification/models/shufflenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models/shufflenet_v2.py
@@ -93,7 +93,6 @@ class ShuffleNetV2():
 
         output = fluid.layers.fc(input=pool_last,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(initializer=MSRA()))
         return output
 
diff --git a/fluid/PaddleCV/image_classification/models/vgg.py b/fluid/PaddleCV/image_classification/models/vgg.py
index 1af664fdb554f05ba8a556abbba36cd1c3141a40..7f559982334575c7c2bc778e1be8a4ebf69549fc 100644
--- a/fluid/PaddleCV/image_classification/models/vgg.py
+++ b/fluid/PaddleCV/image_classification/models/vgg.py
@@ -64,7 +64,6 @@ class VGGNet():
         out = fluid.layers.fc(
             input=fc2,
             size=class_dim,
-            act='softmax',
             param_attr=fluid.param_attr.ParamAttr(
                 initializer=fluid.initializer.Normal(scale=0.005)),
             bias_attr=fluid.param_attr.ParamAttr(
diff --git a/fluid/PaddleCV/image_classification/models_name/alexnet.py b/fluid/PaddleCV/image_classification/models_name/alexnet.py
index 3dfaa25ff790b3952acb304118e6489bb2b58844..f063c4d6deb88905aaa5f8a5eba59903f58293e8 100644
--- a/fluid/PaddleCV/image_classification/models_name/alexnet.py
+++ b/fluid/PaddleCV/image_classification/models_name/alexnet.py
@@ -159,7 +159,6 @@ class AlexNet():
         out = fluid.layers.fc(
             input=fc7,
             size=class_dim,
-            act='softmax',
             bias_attr=fluid.param_attr.ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv),
                 name=layer_name[7] + "_offset"),
diff --git a/fluid/PaddleCV/image_classification/models_name/dpn.py b/fluid/PaddleCV/image_classification/models_name/dpn.py
index a39d907205cedf86848df9c950202c640fdd9dcb..7f759b3bb6bfa9c866e129ac93ab2c6a9cf4168c 100644
--- a/fluid/PaddleCV/image_classification/models_name/dpn.py
+++ b/fluid/PaddleCV/image_classification/models_name/dpn.py
@@ -122,7 +122,6 @@ class DPN(object):
             initializer=fluid.initializer.Uniform(-stdv, stdv))
         fc6 = fluid.layers.fc(input=pool5,
                               size=class_dim,
-                              act='softmax',
                               param_attr=param_attr,
                               name="fc6")
 
diff --git a/fluid/PaddleCV/image_classification/models_name/inception_v4.py b/fluid/PaddleCV/image_classification/models_name/inception_v4.py
index 7b857d456bad99b98d2520653050659ef6c58983..8c6c0dbb129f903b4f0b849f930a520b5f17e5db 100644
--- a/fluid/PaddleCV/image_classification/models_name/inception_v4.py
+++ b/fluid/PaddleCV/image_classification/models_name/inception_v4.py
@@ -48,7 +48,6 @@ class InceptionV4():
         out = fluid.layers.fc(
             input=drop,
             size=class_dim,
-            act='softmax',
             param_attr=ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv),
                 name="final_fc_weights"),
diff --git a/fluid/PaddleCV/image_classification/models_name/mobilenet.py b/fluid/PaddleCV/image_classification/models_name/mobilenet.py
index 2ac6b46abdad13ad5e5507605cd7e1ce9861f620..d242bc946a7b4bec9c9d2e34da2496c0901ba870 100644
--- a/fluid/PaddleCV/image_classification/models_name/mobilenet.py
+++ b/fluid/PaddleCV/image_classification/models_name/mobilenet.py
@@ -130,7 +130,6 @@ class MobileNet():
 
         output = fluid.layers.fc(input=input,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(
                                      initializer=MSRA(), name="fc7_weights"),
                                  bias_attr=ParamAttr(name="fc7_offset"))
diff --git a/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py b/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py
index 442bd67f212d7a3b6c1a0b900ba1489250d670af..77d88c7da625c0c953c75d229148868f0481f2a2 100644
--- a/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py
@@ -80,7 +80,6 @@ class MobileNetV2():
 
         output = fluid.layers.fc(input=input,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(name='fc10_weights'),
                                  bias_attr=ParamAttr(name='fc10_offset'))
         return output
diff --git a/fluid/PaddleCV/image_classification/models_name/resnet.py b/fluid/PaddleCV/image_classification/models_name/resnet.py
index 095bd155b2c2b329e4ce89e94ae410c104a40fd1..19fa4ff2c4c30d0fa11b592c21f3db5e51663159 100644
--- a/fluid/PaddleCV/image_classification/models_name/resnet.py
+++ b/fluid/PaddleCV/image_classification/models_name/resnet.py
@@ -74,7 +74,6 @@ class ResNet():
         stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
         out = fluid.layers.fc(input=pool,
                               size=class_dim,
-                              act='softmax',
                               param_attr=fluid.param_attr.ParamAttr(
                                   initializer=fluid.initializer.Uniform(-stdv,
                                                                         stdv)))
diff --git a/fluid/PaddleCV/image_classification/models_name/se_resnext.py b/fluid/PaddleCV/image_classification/models_name/se_resnext.py
index e083c524f8cefaa7d2483725994ebabae91101e4..0ae3d66fddbe2d1b9da5e2f52fe80d15931d256d 100644
--- a/fluid/PaddleCV/image_classification/models_name/se_resnext.py
+++ b/fluid/PaddleCV/image_classification/models_name/se_resnext.py
@@ -123,7 +123,6 @@ class SE_ResNeXt():
         out = fluid.layers.fc(
             input=drop,
             size=class_dim,
-            act='softmax',
             param_attr=ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv),
                 name='fc6_weights'),
diff --git a/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py b/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py
index 1cd4767ea40b31dd57f4ad3142f2a4533eba1aa9..595debf2199be9609100cb686aad65eb9cb55416 100644
--- a/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py
@@ -97,7 +97,6 @@ class ShuffleNetV2():
 
         output = fluid.layers.fc(input=pool_last,
                                  size=class_dim,
-                                 act='softmax',
                                  param_attr=ParamAttr(
                                      initializer=MSRA(), name='fc6_weights'),
                                  bias_attr=ParamAttr(name='fc6_offset'))
diff --git a/fluid/PaddleCV/image_classification/models_name/vgg.py b/fluid/PaddleCV/image_classification/models_name/vgg.py
index ac26791a6a3ec2aba8fe35ef46bbfd05a30c4777..8fcd2d9f1c397a428685cfb7bd264f18c0d0a7e7 100644
--- a/fluid/PaddleCV/image_classification/models_name/vgg.py
+++ b/fluid/PaddleCV/image_classification/models_name/vgg.py
@@ -61,7 +61,6 @@ class VGGNet():
         out = fluid.layers.fc(
             input=fc2,
             size=class_dim,
-            act='softmax',
             param_attr=fluid.param_attr.ParamAttr(name=fc_name[2] + "_weights"),
             bias_attr=fluid.param_attr.ParamAttr(name=fc_name[2] + "_offset"))
 
diff --git a/fluid/PaddleCV/image_classification/reader.py b/fluid/PaddleCV/image_classification/reader.py
index 316b956a0788e593f63e4cf7592c16eec1b1aba8..d9559df09ba34f3a6512f1c4628d454cd33c9ee2 100644
--- a/fluid/PaddleCV/image_classification/reader.py
+++ b/fluid/PaddleCV/image_classification/reader.py
@@ -130,16 +130,19 @@ def _reader_creator(file_list,
                     shuffle=False,
                     color_jitter=False,
                     rotate=False,
-                    data_dir=DATA_DIR):
+                    data_dir=DATA_DIR,
+                    pass_id_as_seed=0):
     def reader():
         with open(file_list) as flist:
             full_lines = [line.strip() for line in flist]
             if shuffle:
+                if pass_id_as_seed:
+                    np.random.seed(pass_id_as_seed)
                 np.random.shuffle(full_lines)
             if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'):
                 # distributed mode if the env var `PADDLE_TRAINING_ROLE` exits
                 trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
-                trainer_count = int(os.getenv("PADDLE_TRAINERS", "1"))
+                trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
                 per_node_lines = len(full_lines) // trainer_count
                 lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1)
                                    * per_node_lines]
@@ -166,7 +169,7 @@ def _reader_creator(file_list,
     return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
 
 
-def train(data_dir=DATA_DIR):
+def train(data_dir=DATA_DIR, pass_id_as_seed=0):
     file_list = os.path.join(data_dir, 'train_list.txt')
     return _reader_creator(
         file_list,
@@ -174,7 +177,8 @@ def train(data_dir=DATA_DIR):
         shuffle=True,
         color_jitter=False,
         rotate=False,
-        data_dir=data_dir)
+        data_dir=data_dir,
+        pass_id_as_seed=pass_id_as_seed)
 
 
 def val(data_dir=DATA_DIR):
diff --git a/fluid/PaddleCV/image_classification/reader_cv2.py b/fluid/PaddleCV/image_classification/reader_cv2.py
index 2b3291c6c590529aa0e889bb8f7a9866cfa81f08..dd9462c5c7076d57f7e137153a93534152cbb64b 100644
--- a/fluid/PaddleCV/image_classification/reader_cv2.py
+++ b/fluid/PaddleCV/image_classification/reader_cv2.py
@@ -101,8 +101,6 @@ def process_image(sample,
     std = [0.229, 0.224, 0.225] if std is None else std
 
     img_path = sample[0]
-    print('&' * 80)
-    print(img_path)
     img = cv2.imread(img_path)
 
     if mode == 'train':
diff --git a/fluid/PaddleCV/image_classification/run.sh b/fluid/PaddleCV/image_classification/run.sh
index da50b49df7796ff28f995c33c9a39b42645c9a98..fbdacb87633a70b60ecdedf9a6f74e7287d2b2d0 100644
--- a/fluid/PaddleCV/image_classification/run.sh
+++ b/fluid/PaddleCV/image_classification/run.sh
@@ -78,3 +78,58 @@ python train.py \
 #	--num_epochs=120 \
 #       --lr=0.1
 
+#ResNet152:
+#python train.py \
+#       --model=ResNet152 \
+#       --batch_size=256 \
+#       --total_images=1281167 \
+#       --image_shape=3,224,224 \
+#       --lr_strategy=piecewise_decay \
+#       --lr=0.1 \
+#       --num_epochs=120 \
+#       --l2_decay=1e-4 \(TODO)
+
+
+#SE_ResNeXt50:
+#python train.py \
+#       --model=SE_ResNeXt50 \
+#       --batch_size=400 \
+#       --total_images=1281167 \
+#       --image_shape=3,224,224 \
+#       --lr_strategy=cosine_decay \
+#       --lr=0.1 \
+#       --num_epochs=200 \
+#       --l2_decay=12e-5 \(TODO)
+
+#SE_ResNeXt101:
+#python train.py \
+#        --model=SE_ResNeXt101 \
+#        --batch_size=400 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.1 \
+#        --num_epochs=200 \
+#        --l2_decay=15e-5 \(TODO)
+
+#VGG11:
+#python train.py \
+#        --model=VGG11 \
+#        --batch_size=512 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.1 \
+#        --num_epochs=90 \
+#        --l2_decay=2e-4 \(TODO)
+
+#VGG13:
+#python train.py
+#        --model=VGG13 \          
+#        --batch_size=256 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.01 \
+#        --num_epochs=90 \
+#        --l2_decay=3e-4 \(TODO)
diff --git a/fluid/PaddleCV/image_classification/train.py b/fluid/PaddleCV/image_classification/train.py
index e658a3114c91b719c7824503e89e34b9200160f3..6830773b91f2fa07c2b6f530a6370cedde82ffd7 100644
--- a/fluid/PaddleCV/image_classification/train.py
+++ b/fluid/PaddleCV/image_classification/train.py
@@ -17,6 +17,7 @@ import functools
 import subprocess
 import utils
 from utils.learning_rate import cosine_decay
+from utils.fp16_utils import create_master_params_grads, master_param_to_train_param
 from utility import add_arguments, print_arguments
 import models
 import models_name
@@ -40,7 +41,9 @@ add_arg('model',            str,   "SE_ResNeXt50_32x4d", "Set the network to use
 add_arg('enable_ce',        bool,  False,                "If set True, enable continuous evaluation job.")
 add_arg('data_dir',         str,   "./data/ILSVRC2012",  "The ImageNet dataset root dir.")
 add_arg('model_category',   str,   "models",             "Whether to use models_name or not, valid value:'models','models_name'" )
-# yapf: enabl
+add_arg('fp16',             bool,  False,                "Enable half precision training with fp16." )
+add_arg('scale_loss',       float, 1.0,                  "Scale loss for fp16." )
+# yapf: enable
 
 
 def set_models(model):
@@ -145,12 +148,15 @@ def net_config(image, label, model, args):
         acc_top1 = fluid.layers.accuracy(input=out0, label=label, k=1)
         acc_top5 = fluid.layers.accuracy(input=out0, label=label, k=5)
     else:
-        out = model.net(input=image, class_dim=class_dim)
-        cost = fluid.layers.cross_entropy(input=out, label=label)
+        out = model.net(input=image, class_dim=class_dim)    
+        cost, pred = fluid.layers.softmax_with_cross_entropy(out, label, return_softmax=True) 
+        if args.scale_loss > 1:
+            avg_cost = fluid.layers.mean(x=cost) * float(args.scale_loss)
+        else:
+            avg_cost = fluid.layers.mean(x=cost)
 
-        avg_cost = fluid.layers.mean(x=cost)
-        acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
-        acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
+        acc_top1 = fluid.layers.accuracy(input=pred, label=label, k=1)
+        acc_top5 = fluid.layers.accuracy(input=pred, label=label, k=5)
 
     return avg_cost, acc_top1, acc_top5
 
@@ -171,6 +177,8 @@ def build_program(is_train, main_prog, startup_prog, args):
             use_double_buffer=True)
         with fluid.unique_name.guard():
             image, label = fluid.layers.read_file(py_reader)
+            if args.fp16:
+                image = fluid.layers.cast(image, "float16")
             avg_cost, acc_top1, acc_top5 = net_config(image, label, model, args)
             avg_cost.persistable = True
             acc_top1.persistable = True
@@ -184,7 +192,15 @@ def build_program(is_train, main_prog, startup_prog, args):
                 params["learning_strategy"]["name"] = args.lr_strategy
 
                 optimizer = optimizer_setting(params)
-                optimizer.minimize(avg_cost)
+
+                if args.fp16:
+                    params_grads = optimizer.backward(avg_cost)
+                    master_params_grads = create_master_params_grads(
+                        params_grads, main_prog, startup_prog, args.scale_loss)
+                    optimizer.apply_gradients(master_params_grads)
+                    master_param_to_train_param(master_params_grads, params_grads, main_prog)
+                else:
+                    optimizer.minimize(avg_cost)
 
     return py_reader, avg_cost, acc_top1, acc_top5
 
@@ -200,7 +216,6 @@ def train(args):
     startup_prog = fluid.Program()
     train_prog = fluid.Program()
     test_prog = fluid.Program()
-
     if args.enable_ce:
         startup_prog.random_seed = 1000
         train_prog.random_seed = 1000
@@ -240,10 +255,10 @@ def train(args):
     if visible_device:
         device_num = len(visible_device.split(','))
     else:
-        device_num = subprocess.check_output(['nvidia-smi', '-L']).count('\n')
+        device_num = subprocess.check_output(['nvidia-smi', '-L']).decode().count('\n')
 
     train_batch_size = args.batch_size / device_num
-    test_batch_size = 8
+    test_batch_size = 16
     if not args.enable_ce:
         train_reader = paddle.batch(
             reader.train(), batch_size=train_batch_size, drop_last=True)
@@ -307,7 +322,7 @@ def train(args):
         train_loss = np.array(train_info[0]).mean()
         train_acc1 = np.array(train_info[1]).mean()
         train_acc5 = np.array(train_info[2]).mean()
-        train_speed = np.array(train_time).mean() / train_batch_size
+        train_speed = np.array(train_time).mean() / (train_batch_size * device_num)
 
         test_py_reader.start()
 
diff --git a/fluid/PaddleCV/image_classification/utils/__init__.py b/fluid/PaddleCV/image_classification/utils/__init__.py
index f59e4baf93aa095f393441d2cd766ff8d3b28801..4751caceeb14f0dddc937d90b4c953a870ffc3f8 100644
--- a/fluid/PaddleCV/image_classification/utils/__init__.py
+++ b/fluid/PaddleCV/image_classification/utils/__init__.py
@@ -1 +1,2 @@
 from .learning_rate import cosine_decay, lr_warmup
+from .fp16_utils import create_master_params_grads, master_param_to_train_param
diff --git a/fluid/PaddleCV/image_classification/utils/fp16_utils.py b/fluid/PaddleCV/image_classification/utils/fp16_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..cf6e081e7f9b433e4097f190b012c533064f5cca
--- /dev/null
+++ b/fluid/PaddleCV/image_classification/utils/fp16_utils.py
@@ -0,0 +1,78 @@
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        }
+    )
+
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        }
+    )
+
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+
+def create_master_params_grads(params_grads, main_prog, startup_prog, scale_loss):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.startswith("batch_norm"):
+            if scale_loss > 1:
+                scaled_g = g / float(scale_loss)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if scale_loss > 1:
+            master_grad = master_grad / float(scale_loss)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.startswith("batch_norm"):
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
diff --git a/fluid/PaddleCV/metric_learning/README.md b/fluid/PaddleCV/metric_learning/README.md
index c961bf4842727dab8e808ef016d10efa219bf2f6..71ecb5cf1cc10506abf2ce8c225a57b630b56d1a 100644
--- a/fluid/PaddleCV/metric_learning/README.md
+++ b/fluid/PaddleCV/metric_learning/README.md
@@ -1,15 +1,15 @@
 # Deep Metric Learning
-Metric learning is a kind of methods to learn discriminative features for each sample, with the purpose that intra-class samples have smaller distances while inter-class samples have larger distances in the learned space. With the develop of deep learning technique, metric learning methods are combined with deep neural networks to boost the performance of traditional tasks, such as face recognition/verification, human re-identification, image retrieval and so on. In this page, we introduce the way to implement deep metric learning using PaddlePaddle Fluid, including [data preparation](#data-preparation), [training](#training-a-model), [finetuning](#finetuning), [evaluation](#evaluation) and [inference](#inference).
+Metric learning is a kind of methods to learn discriminative features for each sample, with the purpose that intra-class samples have smaller distances while inter-class samples have larger distances in the learned space. With the develop of deep learning technique, metric learning methods are combined with deep neural networks to boost the performance of traditional tasks, such as face recognition/verification, human re-identification, image retrieval and so on. In this page, we introduce the way to implement deep metric learning using PaddlePaddle Fluid, including [data preparation](#data-preparation), [training](#training-metric-learning-models), [finetuning](#finetuning), [evaluation](#evaluation), [inference](#inference) and [Performances](#performances).
 
 ---
 ## Table of Contents
 - [Installation](#installation)
 - [Data preparation](#data-preparation)
-- [Training metric learning models](#training-a-model)
+- [Training metric learning models](#training-metric-learning-models)
 - [Finetuning](#finetuning)
 - [Evaluation](#evaluation)
 - [Inference](#inference)
-- [Performances](#supported-models)
+- [Performances](#performances)
 
 ## Installation
 
@@ -17,7 +17,7 @@ Running sample code in this directory requires PaddelPaddle Fluid v0.14.0 and la
 
 ## Data preparation
 
-Stanford Online Product(SOP) dataset contains 120,053 images of 22,634 products downloaded from eBay.com. We use it to conduct the metric learning experiments. For training, 59,5511 out of 11,318 classes are used, and 11,316 classes(60,502 images) are held out for testing. First of all, preparation of SOP data can be done as:
+Stanford Online Product(SOP) dataset contains 120,053 images of 22,634 products downloaded from eBay.com. We use it to conduct the metric learning experiments. For training, 59,551 out of 11,318 classes are used, and 11,316 classes(60,502 images) are held out for testing. First of all, preparation of SOP data can be done as:
 ```
 cd data/
 sh download_sop.sh
@@ -25,7 +25,7 @@ sh download_sop.sh
 
 ## Training metric learning models
 
-To train a metric learning model, one need to set the neural network as backbone and the metric loss function to optimize. We train meiric learning model using softmax or [arcmargin](https://arxiv.org/abs/1801.07698) loss firstly, and then fine-turned the model using other metric learning loss, such as triplet, [quadruplet](https://arxiv.org/abs/1710.00478) and [eml](https://arxiv.org/abs/1212.6094) loss. One example of training using arcmargin loss is shown below:
+To train a metric learning model, one need to set the neural network as backbone and the metric loss function to optimize. We train meiric learning model using softmax or arcmargin loss firstly, and then fine-turned the model using other metric learning loss, such as triplet, quadruplet and eml loss. One example of training using arcmargin loss is shown below:
 
 
 ```
@@ -52,7 +52,7 @@ python train_elem.py  \
 * **use_gpu**: whether to use GPU or not. Default: True.
 * **pretrained_model**: model path for pretraining. Default: None.
 * **model_save_dir**: the directory to save trained model. Default: "output".
-* **loss_name**: loss fortraining model. Default: "softmax".
+* **loss_name**: loss for training model. Default: "softmax".
 * **arc_scale**: parameter of arcmargin loss. Default: 80.0.
 * **arc_margin**: parameter of arcmargin loss. Default: 0.15.
 * **arc_easy_margin**: parameter of arcmargin loss. Default: False.
@@ -103,3 +103,9 @@ For comparation, many metric learning models with different neural networks and
 |fine-tuned with triplet | 78.37% | 79.21%
 |fine-tuned with quadruplet | 78.10% | 79.59%
 |fine-tuned with eml | 79.32% | 80.11%
+
+## Reference
+
+- ArcFace: Additive Angular Margin Loss for Deep Face Recognition [link](https://arxiv.org/abs/1801.07698)
+- Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification [link](https://arxiv.org/abs/1710.00478)
+- Large Scale Strongly Supervised Ensemble Metric Learning, with Applications to Face Verification and Retrieval [link](https://arxiv.org/abs/1212.6094)
diff --git a/fluid/PaddleCV/metric_learning/README_cn.md b/fluid/PaddleCV/metric_learning/README_cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..c155200c64e8e21549a6d642f89d95fdcb0acd11
--- /dev/null
+++ b/fluid/PaddleCV/metric_learning/README_cn.md
@@ -0,0 +1,111 @@
+# 深度度量学习
+度量学习是一种为样本对学习具有区分性特征的方法，目的是在特征空间中，让同一个类别的样本具有较小的特征距离，不同类的样本具有较大的特征距离。随着深度学习技术的发展，基于深度神经网络的度量学习方法已经在许多视觉任务上提升了很大的性能，例如：人脸识别、人脸校验、行人重识别和图像检索等等。在本章节，介绍在PaddlePaddle Fluid里实现的几种度量学习方法和使用方法，具体包括[数据准备](#数据准备)，[模型训练](#模型训练)，[模型微调](#模型微调)，[模型评估](#模型评估)，[模型预测](#模型预测)。
+
+---
+## 简介
+- [安装](#安装)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型微调](#模型微调)
+- [模型评估](#模型评估)
+- [模型预测](#模型预测)
+- [模型性能](#模型性能)
+
+## 安装
+
+运行本章节代码需要在PaddlePaddle Fluid v0.14.0 或更高的版本环境。如果你的设备上的PaddlePaddle版本低于v0.14.0，请按照此[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)进行安装和跟新。
+
+## 数据准备
+
+Stanford Online Product(SOP) 数据集下载自eBay，包含120053张商品图片，有22634个类别。我们使用该数据集进行实验。训练时，使用59551张图片，11318个类别的数据；测试时，使用60502张图片，11316个类别。首先，SOP数据集可以使用以下脚本下载：
+```
+cd data/
+sh download_sop.sh
+```
+
+## 模型训练 
+
+为了训练度量学习模型，我们需要一个神经网络模型作为骨架模型（如ResNet50）和度量学习代价函数来进行优化。我们首先使用 softmax 或者 arcmargin 来进行训练，然后使用其它的代价函数来进行微调，例如：triplet，quadruplet和eml。下面是一个使用arcmargin训练的例子：
+
+
+```
+python train_elem.py  \
+        --model=ResNet50 \
+        --train_batch_size=256 \
+        --test_batch_size=50 \
+        --lr=0.01 \
+        --total_iter_num=30000 \
+        --use_gpu=True \
+        --pretrained_model=${path_to_pretrain_imagenet_model} \
+        --model_save_dir=${output_model_path} \
+        --loss_name=arcmargin \
+        --arc_scale=80.0 \ 
+        --arc_margin=0.15 \
+        --arc_easy_margin=False
+```
+**参数介绍:**
+* **model**: 使用的模型名字. 默认: "ResNet50".
+* **train_batch_size**: 训练的 mini-batch大小. 默认: 256.
+* **test_batch_size**: 测试的 mini-batch大小. 默认: 50.
+* **lr**: 初始学习率. 默认: 0.01.
+* **total_iter_num**: 总的训练迭代轮数. 默认: 30000.
+* **use_gpu**: 是否使用GPU. 默认: True.
+* **pretrained_model**: 预训练模型的路径. 默认: None.
+* **model_save_dir**: 保存模型的路径. 默认: "output".
+* **loss_name**: 优化的代价函数. 默认: "softmax".
+* **arc_scale**: arcmargin的参数. 默认: 80.0.
+* **arc_margin**: arcmargin的参数. 默认: 0.15.
+* **arc_easy_margin**: arcmargin的参数. 默认: False.
+
+## 模型微调
+
+网络微调是在指定的任务上加载已有的模型来微调网络。在用softmax和arcmargin训完网络后，可以继续使用triplet，quadruplet或eml来微调网络。下面是一个使用eml来微调网络的例子：
+
+```
+python train_pair.py  \
+        --model=ResNet50 \
+        --train_batch_size=160 \
+        --test_batch_size=50 \
+        --lr=0.0001 \
+        --total_iter_num=100000 \
+        --use_gpu=True \
+        --pretrained_model=${path_to_pretrain_arcmargin_model} \
+        --model_save_dir=${output_model_path} \
+        --loss_name=eml \
+        --samples_each_class=2
+```
+
+## 模型评估
+模型评估主要是评估模型的检索性能。这里需要设置```path_to_pretrain_model```。可以使用下面命令来计算Recall@Rank-1。
+```
+python eval.py \
+       --model=ResNet50 \
+       --batch_size=50 \
+       --pretrained_model=${path_to_pretrain_model} \
+```
+
+## 模型预测
+模型预测主要是基于训练好的网络来获取图像数据的特征，下面是模型预测的例子：
+```
+python infer.py \
+       --model=ResNet50 \
+       --batch_size=1 \         
+       --pretrained_model=${path_to_pretrain_model}
+```
+
+## 模型性能
+
+下面列举了几种度量学习的代价函数在SOP数据集上的检索效果，这里使用Recall@Rank-1来进行评估。
+
+|预训练模型 | softmax | arcmargin
+|- | - | -:
+|未微调 | 77.42% | 78.11%
+|使用triplet微调 | 78.37% | 79.21%
+|使用quadruplet微调 | 78.10% | 79.59%
+|使用eml微调 | 79.32% | 80.11%
+
+## 引用
+
+- ArcFace: Additive Angular Margin Loss for Deep Face Recognition [链接](https://arxiv.org/abs/1801.07698)
+- Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification [链接](https://arxiv.org/abs/1710.00478)
+- Large Scale Strongly Supervised Ensemble Metric Learning, with Applications to Face Verification and Retrieval [链接](https://arxiv.org/abs/1212.6094)
diff --git a/fluid/PaddleCV/metric_learning/reader.py b/fluid/PaddleCV/metric_learning/reader.py
index 9c5aaf396d7b535bc4729a31834e2ef0f8151a28..ac8f257ecbadbb08454ffc616b1c587455cc92b6 100644
--- a/fluid/PaddleCV/metric_learning/reader.py
+++ b/fluid/PaddleCV/metric_learning/reader.py
@@ -63,6 +63,7 @@ def common_iterator(data, settings):
     assert (batch_size % samples_each_class == 0)
     class_num = batch_size // samples_each_class 
     def train_iterator():
+        count = 0
         labs = list(data.keys())
         lab_num = len(labs)
         ind = list(range(0, lab_num))
@@ -79,6 +80,9 @@ def common_iterator(data, settings):
                 for anchor_ind_i in anchor_ind:
                     anchor_path = DATA_DIR + data_list[anchor_ind_i]
                     yield anchor_path, lab
+            count += 1
+            if count >= settings.total_iter_num + 1:
+                return
 
     return train_iterator
 
@@ -86,6 +90,8 @@ def triplet_iterator(data, settings):
     batch_size = settings.train_batch_size
     assert (batch_size % 3 == 0)
     def train_iterator():
+        total_count = settings.train_batch_size * (settings.total_iter_num + 1)
+        count = 0
         labs = list(data.keys())
         lab_num = len(labs)
         ind = list(range(0, lab_num))
@@ -108,16 +114,24 @@ def triplet_iterator(data, settings):
             yield pos_path, lab_pos
             neg_path = DATA_DIR + neg_data_list[neg_ind]
             yield neg_path, lab_neg
+            count += 3
+            if count >= total_count:
+                return
 
     return train_iterator
 
 def arcmargin_iterator(data, settings):
     def train_iterator():
+        total_count = settings.train_batch_size * (settings.total_iter_num + 1)
+        count = 0
         while True:
             for items in data:
                 path, label = items
                 path = DATA_DIR + path
                 yield path, label
+                count += 1
+                if count >= total_count:
+                    return
     return train_iterator
 
 def image_iterator(data, mode):
diff --git a/fluid/PaddleCV/object_detection/README.md b/fluid/PaddleCV/object_detection/README.md
index 651016cdffa7fe6c4fa1dc5e886b9b18e8e40b04..2466ba96577c7cb1e2bb335a0b8b5c74edbb92fd 100644
--- a/fluid/PaddleCV/object_detection/README.md
+++ b/fluid/PaddleCV/object_detection/README.md
@@ -21,9 +21,7 @@ SSD is readily pluggable into a wide variant standard convolutional network, suc
 
 ### Data Preparation
 
-You can use [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) or [MS-COCO dataset](http://cocodataset.org/#download).
-
-If you want to train a model on PASCAL VOC dataset, please download dataset at first, skip this step if you already have one.
+Please download [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) at first, skip this step if you already have one.
 
 ```bash
 cd data/pascalvoc
@@ -32,30 +30,18 @@ cd data/pascalvoc
 
 The command `download.sh` also will create training and testing file lists.
 
-If you want to train a model on MS-COCO dataset, please download dataset at first, skip this step if you already have one.
-
-```
-cd data/coco
-./download.sh
-```
-
 ### Train
 
 #### Download the Pre-trained Model.
 
-We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer.
-
-Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md). The MobileNet-v1 model is converted from [Caffe](https://github.com/shicai/MobileNet-Caffe).
-We will release the pre-trained models by ourself in the upcoming soon.
+We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer. Download MobileNet-v1 SSD:
 
-  - Download MobileNet-v1 SSD:
     ```bash
     ./pretrained/download_coco.sh
     ```
-  - Download MobileNet-v1:
-    ```bash
-    ./pretrained/download_imagenet.sh
-    ```
+
+Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md).
+
 
 #### Train on PASCAL VOC
 
@@ -64,7 +50,6 @@ We will release the pre-trained models by ourself in the upcoming soon.
   python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
   ```
    - Set ```export CUDA_VISIBLE_DEVICES=0,1``` to specifiy the number of GPU you want to use.
-   - Set ```--dataset='coco2014'``` or ```--dataset='coco2017'``` to train model on MS COCO dataset.
    - For more help on arguments:
 
   ```bash
@@ -88,19 +73,6 @@ You can evaluate your trained model in different metrics like 11point, integral
 python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
 ```
 
-You can set ```--dataset``` to ```coco2014``` or ```coco2017``` to evaluate COCO dataset. Moreover, we provide `eval_coco_map.py` which uses a COCO-specific mAP metric defined by [COCO committee](http://cocodataset.org/#detections-eval). To use this eval_coco_map.py, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
-Install the cocoapi:
-```
-# COCOAPI=/path/to/clone/cocoapi
-git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
-cd $COCOAPI/PythonAPI
-# Install into global site-packages
-make install
-# Alternatively, if you do not have permissions or prefer
-# not to install the COCO API into global site-packages
-python2 setup.py install --user
-```
-
 ### Infer and Visualize
 `infer.py` is the main caller of the inferring module. Examples of usage are shown below.
 ```bash
diff --git a/fluid/PaddleCV/object_detection/README_cn.md b/fluid/PaddleCV/object_detection/README_cn.md
index 99603953a9dad956bcd13e7af68c59a9ae45c9cd..8c4cecab28e49c10820e092d3a521facf4be68ea 100644
--- a/fluid/PaddleCV/object_detection/README_cn.md
+++ b/fluid/PaddleCV/object_detection/README_cn.md
@@ -21,9 +21,8 @@ SSD 可以方便地插入到任何一种标准卷积网络中，比如 VGG、Res
 
 ### 数据准备
 
-你可以使用 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/) 或者 [MS-COCO 数据集](http://cocodataset.org/#download)。
 
-如果你想在 PASCAL VOC 数据集上进行训练，请先使用下面的命令下载数据集。
+请先使用下面的命令下载 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/)：
 
 ```bash
 cd data/pascalvoc
@@ -32,29 +31,19 @@ cd data/pascalvoc
 
 `download.sh` 命令会自动创建训练和测试用的列表文件。
 
-如果你想在 MS-COCO 数据集上进行训练，请先使用下面的命令下载数据集。
-
-```
-cd data/coco
-./download.sh
-```
 
 ### 模型训练
 
 #### 下载预训练模型
 
-我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD，我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1，我们也将最后的全连接层移除以便进行目标检测训练。
-
-声明：MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)。我们不久也会发布我们自己预训练的模型。
+我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD，我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1，我们也将最后的全连接层移除以便进行目标检测训练。下载 MobileNet-v1 SSD:
 
-  - 下载 MobileNet-v1 SSD:
     ```bash
     ./pretrained/download_coco.sh
     ```
-  - 下载 MobileNet-v1:
-    ```bash
-    ./pretrained/download_imagenet.sh
-    ```
+
+声明：MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)。
+
 
 #### 训练
 
@@ -63,7 +52,6 @@ cd data/coco
   python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
   ```
    - 可以通过设置 ```export CUDA_VISIBLE_DEVICES=0,1``` 指定想要使用的GPU数量。
-   - 可以通过设置 ```--dataset='coco2014'``` 或 ```--dataset='coco2017'``` 指定训练 MS-COCO数据集。
    - 更多的可选参数见:
 
   ```bash
@@ -80,25 +68,13 @@ cd data/coco
 
 ### 模型评估
 
-你可以使用11point、integral等指标在PASCAL VOC 和 COCO 数据集上评估训练好的模型。不失一般性，我们采用相应数据集的测试列表作为样例代码的默认列表，你也可以通过设置```--test_list```来指定自己的测试样本列表。
+你可以使用11point、integral等指标在PASCAL VOC 数据集上评估训练好的模型。不失一般性，我们采用相应数据集的测试列表作为样例代码的默认列表，你也可以通过设置```--test_list```来指定自己的测试样本列表。
 
 `eval.py`是评估模块的主要执行程序，调用示例如下：
 ```bash
 python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
 ```
 
-你可以设置```--dataset``` 为 ```coco2014``` 或 ```coco2017```来评估 COCO 数据集。我们也提供了`eval_coco_map.py`以进行[COCO官方评估](http://cocodataset.org/#detections-eval)。若要使用 eval_coco_map.py, 需要首先下载[cocoapi](https://github.com/cocodataset/cocoapi)：
-```
-# COCOAPI=/path/to/clone/cocoapi
-git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
-cd $COCOAPI/PythonAPI
-# Install into global site-packages
-make install
-# Alternatively, if you do not have permissions or prefer
-# not to install the COCO API into global site-packages
-python2 setup.py install --user
-```
-
 ### 模型预测以及可视化
 
 `infer.py`是预测及可视化模块的主要执行程序，调用示例如下：
diff --git a/fluid/PaddleCV/object_detection/README_quant.md b/fluid/PaddleCV/object_detection/README_quant.md
index 6723a48832d1b5210436eb2001234c6fe9149736..7ea7f7bd79d21ba34c84d1a1b48a5298837939ac 100644
--- a/fluid/PaddleCV/object_detection/README_quant.md
+++ b/fluid/PaddleCV/object_detection/README_quant.md
@@ -2,7 +2,7 @@
 
 ### Introduction
 
-The quantization-aware training used in this experiments is introduced in [fixed-point quantization desigin](https://gthub.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/quantization/fixed_point_quantization.md). Since quantization-aware training is still an active area of research and experimentation,
+The quantization-aware training used in this experiments is introduced in [fixed-point quantization desigin](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/quantization/fixed_point_quantization.md). Since quantization-aware training is still an active area of research and experimentation,
 here, we just give an simple quantization training usage in Fluid based on MobileNet-SSD model, and more other exeperiments are still needed, like how to quantization traning by considering fusing batch normalization and convolution/fully-connected layers, channel-wise quantization of weights and so on.
 
 
@@ -130,6 +130,9 @@ A Python transpiler is used to rewrite Fluid training program or evaluation prog
   ```
   See 002271.jpg for the visualized image with bbouding boxes.
 
+
+  **Note**, if you want to convert model to 8-bit, you should call `fluid.contrib.QuantizeTranspiler.convert_to_int8` to do this. But, now Paddle can't load 8-bit model to do inference.
+
 ### Results
 
 Results of MobileNet-v1-SSD 300x300 model on PascalVOC dataset.
diff --git a/fluid/PaddleCV/object_detection/_ce.py b/fluid/PaddleCV/object_detection/_ce.py
index 5f5d3e013a1bfca1ca0a0d1b6fb93a76a242496e..6f300e162b1c1940a2c8f1463953f0bcbeaa0a78 100644
--- a/fluid/PaddleCV/object_detection/_ce.py
+++ b/fluid/PaddleCV/object_detection/_ce.py
@@ -9,10 +9,10 @@ from kpi import CostKpi, DurationKpi, AccKpi
 
 train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
 test_acc_kpi = AccKpi('test_acc', 0.01, 0, actived=False)
-train_speed_kpi = AccKpi('train_speed', 0.1, 0, actived=True)
+train_speed_kpi = DurationKpi('train_speed', 0.1, 0, actived=True, unit_repr="s/epoch")
 train_cost_card4_kpi = CostKpi('train_cost_card4', 0.02, 0, actived=True)
 test_acc_card4_kpi = AccKpi('test_acc_card4', 0.01, 0, actived=False)
-train_speed_card4_kpi = AccKpi('train_speed_card4', 0.1, 0, actived=True)
+train_speed_card4_kpi = DurationKpi('train_speed_card4', 0.1, 0, actived=True, unit_repr="s/epoch")
 
 tracking_kpis = [
     train_cost_kpi,
diff --git a/fluid/PaddleCV/object_detection/data_util.py b/fluid/PaddleCV/object_detection/data_util.py
deleted file mode 100644
index ac022593119e0008c3f7f3858303cbf5bc717650..0000000000000000000000000000000000000000
--- a/fluid/PaddleCV/object_detection/data_util.py
+++ /dev/null
@@ -1,151 +0,0 @@
-"""
-This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
-"""
-
-import time
-import numpy as np
-import threading
-import multiprocessing
-try:
-    import queue
-except ImportError:
-    import Queue as queue
-
-
-class GeneratorEnqueuer(object):
-    """
-    Builds a queue out of a data generator.
-
-    Args:
-        generator: a generator function which endlessly yields data
-        use_multiprocessing (bool): use multiprocessing if True,
-            otherwise use threading.
-        wait_time (float): time to sleep in-between calls to `put()`.
-        random_seed (int): Initial seed for workers,
-            will be incremented by one for each workers.
-    """
-
-    def __init__(self,
-                 generator,
-                 use_multiprocessing=False,
-                 wait_time=0.05,
-                 random_seed=None):
-        self.wait_time = wait_time
-        self._generator = generator
-        self._use_multiprocessing = use_multiprocessing
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-        self._manager = None
-        self.seed = random_seed
-
-    def start(self, workers=1, max_queue_size=10):
-        """
-        Start worker threads which add data from the generator into the queue.
-
-        Args:
-            workers (int): number of worker threads
-            max_queue_size (int): queue size
-                (when full, threads could block on `put()`)
-        """
-
-        def data_generator_task():
-            """
-            Data generator task.
-            """
-
-            def task():
-                if (self.queue is not None and
-                        self.queue.qsize() < max_queue_size):
-                    generator_output = next(self._generator)
-                    self.queue.put((generator_output))
-                else:
-                    time.sleep(self.wait_time)
-
-            if not self._use_multiprocessing:
-                while not self._stop_event.is_set():
-                    with self.genlock:
-                        try:
-                            task()
-                        except Exception:
-                            self._stop_event.set()
-                            break
-            else:
-                while not self._stop_event.is_set():
-                    try:
-                        task()
-                    except Exception:
-                        self._stop_event.set()
-                        break
-
-        try:
-            if self._use_multiprocessing:
-                self._manager = multiprocessing.Manager()
-                self.queue = self._manager.Queue(maxsize=max_queue_size)
-                self._stop_event = multiprocessing.Event()
-            else:
-                self.genlock = threading.Lock()
-                self.queue = queue.Queue()
-                self._stop_event = threading.Event()
-            for _ in range(workers):
-                if self._use_multiprocessing:
-                    # Reset random seed else all children processes
-                    # share the same seed
-                    np.random.seed(self.seed)
-                    thread = multiprocessing.Process(target=data_generator_task)
-                    thread.daemon = True
-                    if self.seed is not None:
-                        self.seed += 1
-                else:
-                    thread = threading.Thread(target=data_generator_task)
-                self._threads.append(thread)
-                thread.start()
-        except:
-            self.stop()
-            raise
-
-    def is_running(self):
-        """
-        Returns:
-            bool: Whether the worker theads are running.
-        """
-        return self._stop_event is not None and not self._stop_event.is_set()
-
-    def stop(self, timeout=None):
-        """
-        Stops running threads and wait for them to exit, if necessary.
-        Should be called by the same thread which called `start()`.
-
-        Args:
-            timeout(int|None): maximum time to wait on `thread.join()`.
-        """
-        if self.is_running():
-            self._stop_event.set()
-        for thread in self._threads:
-            if self._use_multiprocessing:
-                if thread.is_alive():
-                    thread.terminate()
-            else:
-                thread.join(timeout)
-        if self._manager:
-            self._manager.shutdown()
-
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-
-    def get(self):
-        """
-        Creates a generator to extract data from the queue.
-        Skip the data if it is `None`.
-
-        # Yields
-            tuple of data in the queue.
-        """
-        while self.is_running():
-            if not self.queue.empty():
-                inputs = self.queue.get()
-                if inputs is not None:
-                    yield inputs
-            else:
-                time.sleep(self.wait_time)
diff --git a/fluid/PaddleCV/object_detection/eval.py b/fluid/PaddleCV/object_detection/eval.py
index 106fb67e073648f94934e7b17f02b964d276e5ec..157384b04f40ab2e3023fa57269267219b16d62d 100644
--- a/fluid/PaddleCV/object_detection/eval.py
+++ b/fluid/PaddleCV/object_detection/eval.py
@@ -52,7 +52,7 @@ def build_program(main_prog, startup_prog, args, data_args):
             nmsed_out = fluid.layers.detection_output(
                 locs, confs, box, box_var, nms_threshold=args.nms_threshold)
             with fluid.program_guard(main_prog):
-                map = fluid.evaluator.DetectionMAP(
+                map = fluid.metrics.DetectionMAP(
                     nmsed_out,
                     gt_label,
                     gt_box,
diff --git a/fluid/PaddleCV/object_detection/eval_coco_map.py b/fluid/PaddleCV/object_detection/eval_coco_map.py
index 0837f42ad89cda1e6a81825bc0545a11b48c4b3c..3e4d4ab8b3460263221b90d0dce787439f014f5b 100644
--- a/fluid/PaddleCV/object_detection/eval_coco_map.py
+++ b/fluid/PaddleCV/object_detection/eval_coco_map.py
@@ -47,7 +47,7 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
     gt_iscrowd = fluid.layers.data(
         name='gt_iscrowd', shape=[1], dtype='int32', lod_level=1)
     gt_image_info = fluid.layers.data(
-        name='gt_image_id', shape=[3], dtype='int32', lod_level=1)
+        name='gt_image_id', shape=[3], dtype='int32')
 
     locs, confs, box, box_var = mobile_net(num_classes, image, image_shape)
     nmsed_out = fluid.layers.detection_output(
@@ -57,14 +57,14 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
 
     place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
     # yapf: disable
     if model_dir:
         def if_exist(var):
             return os.path.exists(os.path.join(model_dir, var.name))
         fluid.io.load_vars(exe, model_dir, predicate=if_exist)
     # yapf: enable
-    test_reader = paddle.batch(
-        reader.test(data_args, test_list), batch_size=batch_size)
+    test_reader = reader.test(data_args, test_list, batch_size)
     feeder = fluid.DataFeeder(
         place=place,
         feed_list=[image, gt_box, gt_label, gt_iscrowd, gt_image_info])
@@ -146,8 +146,7 @@ if __name__ == '__main__':
         mean_value=[args.mean_value_B, args.mean_value_G, args.mean_value_R],
         apply_distort=False,
         apply_expand=False,
-        ap_version=args.ap_version,
-        toy=0)
+        ap_version=args.ap_version)
     eval(
         args,
         data_args=data_args,
diff --git a/fluid/PaddleCV/object_detection/main_quant.py b/fluid/PaddleCV/object_detection/main_quant.py
index bd7d377e69e95dcaf066a40941cf48091583e7ab..2927858a1ea34fbc3adc6eb996ac596c50846fa4 100644
--- a/fluid/PaddleCV/object_detection/main_quant.py
+++ b/fluid/PaddleCV/object_detection/main_quant.py
@@ -85,7 +85,6 @@ def train(args,
 
     batch_size = train_params['batch_size']
     batch_size_per_device = batch_size // devices_num
-    iters_per_epoc = train_params["train_images"] // batch_size
     num_workers = 4
 
     startup_prog = fluid.Program()
@@ -134,22 +133,22 @@ def train(args,
                                 train_file_list,
                                 batch_size_per_device,
                                 shuffle=is_shuffle,
-                                use_multiprocessing=True,
-                                num_workers=num_workers,
-                                max_queue=24)
+                                num_workers=num_workers)
     test_reader = reader.test(data_args, val_file_list, batch_size)
     train_py_reader.decorate_paddle_reader(train_reader)
     test_py_reader.decorate_paddle_reader(test_reader)
 
     train_py_reader.start()
     best_map = 0.
-    try:
-        for epoc in range(epoc_num):
-            if epoc == 0:
-                # test quantized model without quantization-aware training.
-                test_map = test(exe, test_prog, map_eval, test_py_reader)
-            # train
-            for batch in range(iters_per_epoc):
+    for epoc in range(epoc_num):
+        if epoc == 0:
+            # test quantized model without quantization-aware training.
+            test_map = test(exe, test_prog, map_eval, test_py_reader)
+        batch = 0
+        train_py_reader.start()
+        while True:
+            try:
+                # train
                 start_time = time.time()
                 if parallel:
                     outs = train_exe.run(fetch_list=[loss.name])
@@ -157,18 +156,19 @@ def train(args,
                     outs = exe.run(train_prog, fetch_list=[loss])
                 end_time = time.time()
                 avg_loss = np.mean(np.array(outs[0]))
-                if batch % 20 == 0:
+                if batch % 10 == 0:
                     print("Epoc {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
                         epoc , batch, avg_loss, end_time - start_time))
-            end_time = time.time()
-            test_map = test(exe, test_prog, map_eval, test_py_reader)
-            save_model(exe, train_prog, model_save_dir, str(epoc))
-            if test_map > best_map:
-                best_map = test_map
-                save_model(exe, train_prog, model_save_dir, 'best_map')
-            print("Best test map {0}".format(best_map))
-    except (fluid.core.EOFException, StopIteration):
-        train_py_reader.reset()
+            except (fluid.core.EOFException, StopIteration):
+                train_reader().close()
+                train_py_reader.reset()
+                break
+        test_map = test(exe, test_prog, map_eval, test_py_reader)
+        save_model(exe, train_prog, model_save_dir, str(epoc))
+        if test_map > best_map:
+            best_map = test_map
+            save_model(exe, train_prog, model_save_dir, 'best_map')
+        print("Best test map {0}".format(best_map))
 
 
 def eval(args, data_args, configs, val_file_list):
@@ -212,6 +212,9 @@ def eval(args, data_args, configs, val_file_list):
 
     test_map = test(exe, test_prog, map_eval, test_py_reader)
     print("Test model {0}, map {1}".format(init_model, test_map))
+    # convert model to 8-bit before saving, but now Paddle can't load
+    # the 8-bit model to do inference.
+    # transpiler.convert_to_int8(test_prog, place)
     fluid.io.save_inference_model(model_save_dir, [image.name],
                                   [nmsed_out], exe, test_prog)
 
diff --git a/fluid/PaddleCV/object_detection/reader.py b/fluid/PaddleCV/object_detection/reader.py
index 59da1b38fb2e9cce8bb99a2773e7fc222ee33bd8..3559591c4ed5741d52f44bd92f4398d133b2e104 100644
--- a/fluid/PaddleCV/object_detection/reader.py
+++ b/fluid/PaddleCV/object_detection/reader.py
@@ -12,17 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import image_util
-from paddle.utils.image_util import *
-from PIL import Image
-from PIL import ImageDraw
-import numpy as np
 import xml.etree.ElementTree
 import os
 import time
 import copy
 import six
-from data_util import GeneratorEnqueuer
+import math
+import numpy as np
+from PIL import Image
+from PIL import ImageDraw
+import image_util
+import paddle
 
 
 class Settings(object):
@@ -162,24 +162,19 @@ def preprocess(img, bbox_labels, mode, settings):
     return img, sampled_labels
 
 
-def coco(settings, file_list, mode, batch_size, shuffle):
-    # cocoapi
+def coco(settings, coco_api, file_list, mode, batch_size, shuffle, data_dir):
     from pycocotools.coco import COCO
-    from pycocotools.cocoeval import COCOeval
-
-    coco = COCO(file_list)
-    image_ids = coco.getImgIds()
-    images = coco.loadImgs(image_ids)
-    print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
 
     def reader():
         if mode == 'train' and shuffle:
-            np.random.shuffle(images)
+            np.random.shuffle(file_list)
         batch_out = []
-        for image in images:
+        for image in file_list:
             image_name = image['file_name']
-            image_path = os.path.join(settings.data_dir, image_name)
-
+            image_path = os.path.join(data_dir, image_name)
+            if not os.path.exists(image_path):
+                raise ValueError("%s is not exist, you should specify "
+                                 "data path correctly." % image_path)
             im = Image.open(image_path)
             if im.mode == 'L':
                 im = im.convert('RGB')
@@ -188,8 +183,8 @@ def coco(settings, file_list, mode, batch_size, shuffle):
 
             # layout: category_id | xmin | ymin | xmax | ymax | iscrowd
             bbox_labels = []
-            annIds = coco.getAnnIds(imgIds=image['id'])
-            anns = coco.loadAnns(annIds)
+            annIds = coco_api.getAnnIds(imgIds=image['id'])
+            anns = coco_api.loadAnns(annIds)
             for ann in anns:
                 bbox_sample = []
                 # start from 1, leave 0 to background
@@ -229,20 +224,18 @@ def coco(settings, file_list, mode, batch_size, shuffle):
 
 
 def pascalvoc(settings, file_list, mode, batch_size, shuffle):
-    flist = open(file_list)
-    images = [line.strip() for line in flist]
-    print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
-
     def reader():
         if mode == 'train' and shuffle:
-            np.random.shuffle(images)
+            np.random.shuffle(file_list)
         batch_out = []
         cnt = 0
-        for image in images:
+        for image in file_list:
             image_path, label_path = image.split()
             image_path = os.path.join(settings.data_dir, image_path)
             label_path = os.path.join(settings.data_dir, label_path)
-
+            if not os.path.exists(image_path):
+                raise ValueError("%s is not exist, you should specify "
+                                 "data path correctly." % image_path)
             im = Image.open(image_path)
             if im.mode == 'L':
                 im = im.convert('RGB')
@@ -290,57 +283,62 @@ def train(settings,
           file_list,
           batch_size,
           shuffle=True,
-          use_multiprocessing=True,
           num_workers=8,
-          max_queue=24,
           enable_ce=False):
-    file_list = os.path.join(settings.data_dir, file_list)
-
+    file_path = os.path.join(settings.data_dir, file_list)
+    readers = []
     if 'coco' in settings.dataset:
-        generator = coco(settings, file_list, "train", batch_size, shuffle)
+        # cocoapi
+        from pycocotools.coco import COCO
+        coco_api = COCO(file_path)
+        image_ids = coco_api.getImgIds()
+        images = coco_api.loadImgs(image_ids)
+        n = int(math.ceil(len(images) // num_workers))
+        image_lists = [images[i:i + n] for i in range(0, len(images), n)]
+
+        if '2014' in file_list:
+            sub_dir = "train2014"
+        elif '2017' in file_list:
+            sub_dir = "train2017"
+        data_dir = os.path.join(settings.data_dir, sub_dir)
+        for l in image_lists:
+            readers.append(
+                coco(settings, coco_api, l, 'train', batch_size, shuffle,
+                     data_dir))
     else:
-        generator = pascalvoc(settings, file_list, "train", batch_size, shuffle)
-
-    def infinite_reader():
-        while True:
-            for data in generator():
-                yield data
+        images = [line.strip() for line in open(file_path)]
+        n = int(math.ceil(len(images) // num_workers))
+        image_lists = [images[i:i + n] for i in range(0, len(images), n)]
+        for l in image_lists:
+            readers.append(pascalvoc(settings, l, 'train', batch_size, shuffle))
 
-    def reader():
-        try:
-            enqueuer = GeneratorEnqueuer(
-                infinite_reader(), use_multiprocessing=use_multiprocessing)
-            enqueuer.start(max_queue_size=max_queue, workers=num_workers)
-            generator_output = None
-            while True:
-                while enqueuer.is_running():
-                    if not enqueuer.queue.empty():
-                        generator_output = enqueuer.queue.get()
-                        break
-                    else:
-                        time.sleep(0.02)
-                yield generator_output
-                generator_output = None
-        finally:
-            if enqueuer is not None:
-                enqueuer.stop()
-
-    if enable_ce:
-        return infinite_reader
-    else:
-        return reader
+    return paddle.reader.multiprocess_reader(readers, False)
 
 
 def test(settings, file_list, batch_size):
     file_list = os.path.join(settings.data_dir, file_list)
     if 'coco' in settings.dataset:
-        return coco(settings, file_list, 'test', batch_size, False)
+        from pycocotools.coco import COCO
+        coco_api = COCO(file_list)
+        image_ids = coco_api.getImgIds()
+        images = coco_api.loadImgs(image_ids)
+        if '2014' in file_list:
+            sub_dir = "val2014"
+        elif '2017' in file_list:
+            sub_dir = "val2017"
+        data_dir = os.path.join(settings.data_dir, sub_dir)
+        return coco(settings, coco_api, images, 'test', batch_size, False,
+                    data_dir)
     else:
-        return pascalvoc(settings, file_list, 'test', batch_size, False)
+        image_list = [line.strip() for line in open(file_list)]
+        return pascalvoc(settings, image_list, 'test', batch_size, False)
 
 
 def infer(settings, image_path):
     def reader():
+        if not os.path.exists(image_path):
+            raise ValueError("%s is not exist, you should specify "
+                             "data path correctly." % image_path)
         img = Image.open(image_path)
         if img.mode == 'L':
             img = im.convert('RGB')
diff --git a/fluid/PaddleCV/object_detection/train.py b/fluid/PaddleCV/object_detection/train.py
index 2d830bcdf1d7900ca2f27055a9ec7568f75b6211..6fb7ce0236dc63f39de597788bc425f8dfa5ae6d 100644
--- a/fluid/PaddleCV/object_detection/train.py
+++ b/fluid/PaddleCV/object_detection/train.py
@@ -105,7 +105,7 @@ def build_program(main_prog, startup_prog, train_params, is_train):
                 with fluid.unique_name.guard("inference"):
                     nmsed_out = fluid.layers.detection_output(
                         locs, confs, box, box_var, nms_threshold=0.45)
-                    map_eval = fluid.evaluator.DetectionMAP(
+                    map_eval = fluid.metrics.DetectionMAP(
                         nmsed_out,
                         gt_label,
                         gt_box,
@@ -141,7 +141,6 @@ def train(args,
     batch_size = train_params['batch_size']
     epoc_num = train_params['epoc_num']
     batch_size_per_device = batch_size // devices_num
-    iters_per_epoc = train_params["train_images"] // batch_size
     num_workers = 8
 
     startup_prog = fluid.Program()
@@ -186,9 +185,7 @@ def train(args,
                                 train_file_list,
                                 batch_size_per_device,
                                 shuffle=is_shuffle,
-                                use_multiprocessing=True,
                                 num_workers=num_workers,
-                                max_queue=24,
                                 enable_ce=enable_ce)
     test_reader = reader.test(data_args, val_file_list, batch_size)
     train_py_reader.decorate_paddle_reader(train_reader)
@@ -205,7 +202,7 @@ def train(args,
     def test(epoc_id, best_map):
         _, accum_map = map_eval.get_map_var()
         map_eval.reset(exe)
-        every_epoc_map=[]
+        every_epoc_map=[] # for CE
         test_py_reader.start()
         try:
             batch_id = 0
@@ -218,22 +215,23 @@ def train(args,
         except fluid.core.EOFException:
             test_py_reader.reset()
         mean_map = np.mean(every_epoc_map)
-        print("Epoc {0}, test map {1}".format(epoc_id, test_map))
+        print("Epoc {0}, test map {1}".format(epoc_id, test_map[0]))
         if test_map[0] > best_map:
             best_map = test_map[0]
             save_model('best_model', test_prog)
         return best_map, mean_map
 
 
-    train_py_reader.start()
     total_time = 0.0
-    try:
-        for epoc_id in range(epoc_num):
-            epoch_idx = epoc_id + 1
-            start_time = time.time()
-            prev_start_time = start_time
-            every_epoc_loss = []
-            for batch_id in range(iters_per_epoc):
+    for epoc_id in range(epoc_num):
+        epoch_idx = epoc_id + 1
+        start_time = time.time()
+        prev_start_time = start_time
+        every_epoc_loss = []
+        batch_id = 0
+        train_py_reader.start()
+        while True:
+            try:
                 prev_start_time = start_time
                 start_time = time.time()
                 if parallel:
@@ -242,34 +240,35 @@ def train(args,
                     loss_v, = exe.run(train_prog, fetch_list=[loss])
                 loss_v = np.mean(np.array(loss_v))
                 every_epoc_loss.append(loss_v)
-                if batch_id % 20 == 0:
+                if batch_id % 10 == 0:
                     print("Epoc {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
                         epoc_id, batch_id, loss_v, start_time - prev_start_time))
-            end_time = time.time()
-            total_time += end_time - start_time
-
-            best_map, mean_map = test(epoc_id, best_map)
-            print("Best test map {0}".format(best_map))
-            if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
-                save_model(str(epoc_id), train_prog)
-
-            if enable_ce and epoc_id == epoc_num - 1:
-                train_avg_loss = np.mean(every_epoc_loss)
-                if devices_num == 1:
-                    print("kpis	train_cost	%s" % train_avg_loss)
-                    print("kpis	test_acc	%s" % mean_map)
-                    print("kpis	train_speed	%s" % (total_time / epoch_idx))
-                else:
-                    print("kpis	train_cost_card%s	%s" %
-                           (devices_num, train_avg_loss))
-                    print("kpis	test_acc_card%s	%s" %
-                           (devices_num, mean_map))
-                    print("kpis	train_speed_card%s	%f" %
-                           (devices_num, total_time / epoch_idx))
-
-    except (fluid.core.EOFException, StopIteration):
-        train_reader().close()
-        train_py_reader.reset()
+                batch_id += 1
+            except (fluid.core.EOFException, StopIteration):
+                train_reader().close()
+                train_py_reader.reset()
+                break
+
+        end_time = time.time()
+        total_time += end_time - start_time
+        best_map, mean_map = test(epoc_id, best_map)
+        print("Best test map {0}".format(best_map))
+        if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
+            save_model(str(epoc_id), train_prog)
+
+    if enable_ce:
+        train_avg_loss = np.mean(every_epoc_loss)
+        if devices_num == 1:
+            print("kpis	train_cost	%s" % train_avg_loss)
+            print("kpis	test_acc	%s" % mean_map)
+            print("kpis	train_speed	%s" % (total_time / epoch_idx))
+        else:
+            print("kpis	train_cost_card%s	%s" %
+                   (devices_num, train_avg_loss))
+            print("kpis	test_acc_card%s	%s" %
+                   (devices_num, mean_map))
+            print("kpis	train_speed_card%s	%f" %
+                   (devices_num, total_time / epoch_idx))
 
 
 if __name__ == '__main__':
diff --git a/fluid/PaddleCV/ocr_recognition/README.md b/fluid/PaddleCV/ocr_recognition/README.md
index 5f050766d175db399dcab77c13fa470bc0368cd9..8b2d95694631e46d541d46c3f4950fd9a99ce0e3 100644
--- a/fluid/PaddleCV/ocr_recognition/README.md
+++ b/fluid/PaddleCV/ocr_recognition/README.md
@@ -80,7 +80,7 @@
 在训练时，我们通过选项`--train_images` 和 `--train_list` 分别设置准备好的`train_images` 和`train_list`。
 
 
->**注：** 如果`--train_images` 和 `--train_list`都未设置或设置为None， ctc_reader.py会自动下载使用[示例数据](http://paddle-ocr-data.bj.bcebos.com/data.tar.gz)，并将其缓存到`$HOME/.cache/paddle/dataset/ctc_data/data/` 路径下。
+>**注：** 如果`--train_images` 和 `--train_list`都未设置或设置为None， reader.py会自动下载使用[示例数据](http://paddle-ocr-data.bj.bcebos.com/data.tar.gz)，并将其缓存到`$HOME/.cache/paddle/dataset/ctc_data/data/` 路径下。
 
 
 **B. 测试集和评估集**
@@ -119,17 +119,17 @@ data/test_images/00003.jpg
 使用默认数据在GPU单卡上训练:
 
 ```
-env CUDA_VISIBLE_DEVICES=0 python ctc_train.py
+env CUDA_VISIBLE_DEVICES=0 python train.py
 ```
 使用默认数据在CPU上训练:
 ```
-env OMP_NUM_THREADS=<num_of_physical_cores> python ctc_train.py --use_gpu False --parallel=False
+env OMP_NUM_THREADS=<num_of_physical_cores> python train.py --use_gpu False --parallel=False
 ```
 
 使用默认数据在GPU多卡上训练:
 
 ```
-env CUDA_VISIBLE_DEVICES=0,1,2,3 python ctc_train.py --parallel=True
+env CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --parallel=True
 ```
 
 默认使用的是`CTC model`, 可以通过选项`--model="attention"`切换为`attention model`。
diff --git a/fluid/PaddleNLP/chinese_ner/README.md b/fluid/PaddleNLP/chinese_ner/README.md
index c5155b181b45eddcab3cd8d02d2036bd8c1e93ad..a458c83b5f1ad9c007d35ddfb7a6578fb14bbf2a 100644
--- a/fluid/PaddleNLP/chinese_ner/README.md
+++ b/fluid/PaddleNLP/chinese_ner/README.md
@@ -15,7 +15,14 @@
 在data目录下，有两个文件夹，train_files中保存的是训练数据，test_files中保存的是测试数据，作为示例，在目录下我们各放置了两个文件，实际训练时，根据自己的实际需要将数据放置在对应目录，并根据数据格式，修改reader.py中的数据读取函数。
 
 ## 训练
-修改 [train.py](./train.py) 的 `main` 函数，指定数据路径，运行`python train.py`开始训练。
+
+通过运行
+
+```
+python train.py --help
+```
+
+来获取命令行参数的帮助，设置正确的数据路径等参数后，运行`train.py`开始训练。
 
 训练记录形如
 ```txt
@@ -31,7 +38,7 @@ pass_id:2, time_cost:0.740842103958s
 ```
 
 ## 预测
-修改 [infer.py](./infer.py) 的 `infer` 函数，指定：需要测试的模型的路径、测试数据、预测标记文件的路径，运行`python infer.py`开始预测。
+类似于训练过程，预测时指定需要测试模型的路径、测试数据、预测标记文件的路径，运行`infer.py`开始预测。
 
 预测结果如下
 ```txt
diff --git a/fluid/PaddleNLP/chinese_ner/infer.py b/fluid/PaddleNLP/chinese_ner/infer.py
index e22832d38bc5308444201bd302798cf18cae7d99..a15fdb53d89f2f7845e6bb54aa32fe922bb64682 100644
--- a/fluid/PaddleNLP/chinese_ner/infer.py
+++ b/fluid/PaddleNLP/chinese_ner/infer.py
@@ -52,7 +52,7 @@ def parse_args():
 
 def print_arguments(args):
     print('-----------  Configuration Arguments -----------')
-    for arg, value in sorted(vars(args).iteritems()):
+    for arg, value in sorted(vars(args).items()):
         print('%s: %s' % (arg, value))
     print('------------------------------------------------')
 
@@ -61,6 +61,7 @@ def load_reverse_dict(dict_path):
     return dict((idx, line.strip().split("\t")[0])
                 for idx, line in enumerate(open(dict_path, "r").readlines()))
 
+
 def to_lodtensor(data, place):
     seq_lens = [len(seq) for seq in data]
     cur_len = 0
@@ -76,7 +77,6 @@ def to_lodtensor(data, place):
     return res
 
 
-
 def infer(args):
     word = fluid.layers.data(name='word', shape=[1], dtype='int64', lod_level=1)
     mention = fluid.layers.data(
@@ -108,8 +108,8 @@ def infer(args):
                 profiler.reset_profiler()
             iters = 0
             for data in test_data():
-                word = to_lodtensor(map(lambda x: x[0], data), place)
-                mention = to_lodtensor(map(lambda x: x[1], data), place)
+                word = to_lodtensor(list(map(lambda x: x[0], data)), place)
+                mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
 
                 start = time.time()
                 crf_decode = exe.run(inference_program,
@@ -122,12 +122,12 @@ def infer(args):
                 np_data = np.array(crf_decode[0])
                 word_count = 0
                 assert len(data) == len(lod_info) - 1
-                for sen_index in xrange(len(data)):
+                for sen_index in range(len(data)):
                     assert len(data[sen_index][0]) == lod_info[
                         sen_index + 1] - lod_info[sen_index]
                     word_index = 0
-                    for tag_index in xrange(lod_info[sen_index],
-                                            lod_info[sen_index + 1]):
+                    for tag_index in range(lod_info[sen_index],
+                                           lod_info[sen_index + 1]):
                         word = str(data[sen_index][0][word_index])
                         gold_tag = label_reverse_dict[data[sen_index][2][
                             word_index]]
diff --git a/fluid/PaddleNLP/chinese_ner/train.py b/fluid/PaddleNLP/chinese_ner/train.py
index 7e59d2ed0793ae9499fc2a6618e762a9ac426800..fc65528cd34706ea905025702cbea0307bef0686 100644
--- a/fluid/PaddleNLP/chinese_ner/train.py
+++ b/fluid/PaddleNLP/chinese_ner/train.py
@@ -12,7 +12,7 @@ import reader
 
 
 def parse_args():
-    parser = argparse.ArgumentParser("Run inference.")
+    parser = argparse.ArgumentParser("Run training.")
     parser.add_argument(
         '--batch_size',
         type=int,
@@ -65,7 +65,7 @@ def parse_args():
 
 def print_arguments(args):
     print('-----------  Configuration Arguments -----------')
-    for arg, value in sorted(vars(args).iteritems()):
+    for arg, value in sorted(vars(args).items()):
         print('%s: %s' % (arg, value))
     print('------------------------------------------------')
 
@@ -220,9 +220,9 @@ def test2(exe, chunk_evaluator, inference_program, test_data, place,
           cur_fetch_list):
     chunk_evaluator.reset()
     for data in test_data():
-        word = to_lodtensor(map(lambda x: x[0], data), place)
-        mention = to_lodtensor(map(lambda x: x[1], data), place)
-        target = to_lodtensor(map(lambda x: x[2], data), place)
+        word = to_lodtensor(list(map(lambda x: x[0], data)), place)
+        mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
+        target = to_lodtensor(list(map(lambda x: x[2], data)), place)
         result_list = exe.run(
             inference_program,
             feed={"word": word,
@@ -232,8 +232,9 @@ def test2(exe, chunk_evaluator, inference_program, test_data, place,
         number_infer = np.array(result_list[0])
         number_label = np.array(result_list[1])
         number_correct = np.array(result_list[2])
-        chunk_evaluator.update(number_infer[0], number_label[0],
-                               number_correct[0])
+        chunk_evaluator.update(number_infer[0].astype('int64'),
+                               number_label[0].astype('int64'),
+                               number_correct[0].astype('int64'))
     return chunk_evaluator.eval()
 
 
@@ -241,9 +242,9 @@ def test(test_exe, chunk_evaluator, inference_program, test_data, place,
          cur_fetch_list):
     chunk_evaluator.reset()
     for data in test_data():
-        word = to_lodtensor(map(lambda x: x[0], data), place)
-        mention = to_lodtensor(map(lambda x: x[1], data), place)
-        target = to_lodtensor(map(lambda x: x[2], data), place)
+        word = to_lodtensor(list(map(lambda x: x[0], data)), place)
+        mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
+        target = to_lodtensor(list(map(lambda x: x[2], data)), place)
         result_list = test_exe.run(
             fetch_list=cur_fetch_list,
             feed={"word": word,
@@ -252,8 +253,9 @@ def test(test_exe, chunk_evaluator, inference_program, test_data, place,
         number_infer = np.array(result_list[0])
         number_label = np.array(result_list[1])
         number_correct = np.array(result_list[2])
-        chunk_evaluator.update(number_infer.sum(),
-                               number_label.sum(), number_correct.sum())
+        chunk_evaluator.update(number_infer.sum().astype('int64'),
+                               number_label.sum().astype('int64'),
+                               number_correct.sum().astype('int64'))
     return chunk_evaluator.eval()
 
 
@@ -270,11 +272,6 @@ def main(args):
         crf_decode = fluid.layers.crf_decoding(
             input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
 
-        inference_program = fluid.default_main_program().clone(for_test=True)
-
-        sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
-        sgd_optimizer.minimize(avg_cost)
-
         (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
          num_correct_chunks) = fluid.layers.chunk_eval(
              input=crf_decode,
@@ -282,6 +279,11 @@ def main(args):
              chunk_scheme="IOB",
              num_chunk_types=int(math.ceil((args.label_dict_len - 1) / 2.0)))
 
+        inference_program = fluid.default_main_program().clone(for_test=True)
+
+        sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+        sgd_optimizer.minimize(avg_cost)
+
         chunk_evaluator = fluid.metrics.ChunkEvaluator()
 
         train_reader = paddle.batch(
@@ -312,7 +314,7 @@ def main(args):
             test_exe = exe
 
         batch_id = 0
-        for pass_id in xrange(args.num_passes):
+        for pass_id in range(args.num_passes):
             chunk_evaluator.reset()
             train_reader_iter = train_reader()
             start_time = time.time()
@@ -326,9 +328,9 @@ def main(args):
                         ],
                         feed=feeder.feed(cur_batch))
                     chunk_evaluator.update(
-                        np.array(nums_infer).sum(),
-                        np.array(nums_label).sum(),
-                        np.array(nums_correct).sum())
+                        np.array(nums_infer).sum().astype("int64"),
+                        np.array(nums_label).sum().astype("int64"),
+                        np.array(nums_correct).sum().astype("int64"))
                     cost_list = np.array(cost)
                     batch_id += 1
                 except StopIteration:
diff --git a/fluid/PaddleNLP/deep_attention_matching_net/_ce.py b/fluid/PaddleNLP/deep_attention_matching_net/_ce.py
index 0c38c0a3d1b0fc0a240a7bae928d9c07f8b95886..7ad30288074da3124c33fad6c96fd369a812c77c 100644
--- a/fluid/PaddleNLP/deep_attention_matching_net/_ce.py
+++ b/fluid/PaddleNLP/deep_attention_matching_net/_ce.py
@@ -7,8 +7,8 @@ from kpi import CostKpi, DurationKpi, AccKpi
 
 #### NOTE kpi.py should shared in models in some way!!!!
 
-train_cost_kpi = CostKpi('train_cost', 0.02, actived=True)
-train_duration_kpi = DurationKpi('train_duration', 0.05, actived=True)
+train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
+train_duration_kpi = DurationKpi('train_duration', 0.05, 0, actived=True)
 
 tracking_kpis = [
     train_cost_kpi,
diff --git a/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py b/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py
index f240615b59376e8d86ce2ebaddd8eae8ee15fe30..823d13de9facb2537fbb6358e16c339f2ea5354b 100644
--- a/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py
+++ b/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py
@@ -248,8 +248,9 @@ def train(args):
 
     print("device count %d" % dev_count)
     print("theoretical memory usage: ")
-    print(fluid.contrib.memory_usage(
-        program=train_program, batch_size=args.batch_size))
+    print(
+        fluid.contrib.memory_usage(
+            program=train_program, batch_size=args.batch_size))
 
     exe = fluid.Executor(place)
     exe.run(train_startup)
@@ -318,8 +319,9 @@ def train(args):
             if (args.save_path is not None) and (step % save_step == 0):
                 save_path = os.path.join(args.save_path, "step_" + str(step))
                 print("Save model at step %d ... " % step)
-                print(time.strftime('%Y-%m-%d %H:%M:%S',
-                                    time.localtime(time.time())))
+                print(
+                    time.strftime('%Y-%m-%d %H:%M:%S',
+                                  time.localtime(time.time())))
                 fluid.io.save_persistables(exe, save_path, train_program)
 
                 score_path = os.path.join(args.save_path, 'score.' + str(step))
@@ -358,8 +360,9 @@ def train(args):
                     save_path = os.path.join(args.save_path,
                                              "step_" + str(step))
                     print("Save model at step %d ... " % step)
-                    print(time.strftime('%Y-%m-%d %H:%M:%S',
-                                        time.localtime(time.time())))
+                    print(
+                        time.strftime('%Y-%m-%d %H:%M:%S',
+                                      time.localtime(time.time())))
                     fluid.io.save_persistables(exe, save_path, train_program)
 
                     score_path = os.path.join(args.save_path,
@@ -389,7 +392,11 @@ def train(args):
             global_step, last_cost = train_with_pyreader(global_step)
         else:
             global_step, last_cost = train_with_feed(global_step)
-        train_time += time.time() - begin_time
+
+        pass_time_cost = time.time() - begin_time
+        train_time += pass_time_cost
+        print("Pass {0}, pass_time_cost {1}"
+              .format(epoch, "%2.2f sec" % pass_time_cost))
     # For internal continuous evaluation
     if "CE_MODE_X" in os.environ:
         print("kpis	train_cost	%f" % last_cost)
diff --git a/fluid/PaddleNLP/machine_reading_comprehension/_ce.py b/fluid/PaddleNLP/machine_reading_comprehension/_ce.py
index cff13c8722007987a3cd82f1298206248963e45a..a425fe951fb587749f31b18959917cdeed76a41d 100644
--- a/fluid/PaddleNLP/machine_reading_comprehension/_ce.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/_ce.py
@@ -3,6 +3,7 @@
 import os
 import sys
 #sys.path.insert(0, os.environ['ceroot'])
+sys.path.append(os.environ['ceroot'])
 from kpi import CostKpi, DurationKpi, AccKpi
 
 #### NOTE kpi.py should shared in models in some way!!!!
diff --git a/fluid/PaddleNLP/machine_reading_comprehension/dataset.py b/fluid/PaddleNLP/machine_reading_comprehension/dataset.py
index 3aaf87be9a7b0659fa9e79eb8329911cbea73c55..c732ce041c5e82ea5e1471ba422f5b056a7cba8f 100644
--- a/fluid/PaddleNLP/machine_reading_comprehension/dataset.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/dataset.py
@@ -23,6 +23,7 @@ import json
 import logging
 import numpy as np
 from collections import Counter
+import io
 
 
 class BRCDataset(object):
@@ -67,7 +68,7 @@ class BRCDataset(object):
         Args:
             data_path: the data file to load
         """
-        with open(data_path) as fin:
+        with io.open(data_path, 'r', encoding='utf-8') as fin:
             data_set = []
             for lidx, line in enumerate(fin):
                 sample = json.loads(line.strip())
diff --git a/fluid/PaddleNLP/machine_reading_comprehension/run.py b/fluid/PaddleNLP/machine_reading_comprehension/run.py
index dbe3a4b9a296fdaf089d55be3f0c9845422f0ce5..884549d106af7f44789728fb488b5e60e149e118 100644
--- a/fluid/PaddleNLP/machine_reading_comprehension/run.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/run.py
@@ -22,6 +22,7 @@ import os
 import random
 import json
 import six
+import multiprocessing
 
 import paddle
 import paddle.fluid as fluid
@@ -445,7 +446,9 @@ def train(logger, args):
                             logger.info('Dev eval result: {}'.format(
                                 bleu_rouge))
                 pass_end_time = time.time()
-
+                time_consumed = pass_end_time - pass_start_time
+                logger.info('epoch: {0}, epoch_time_cost: {1:.2f}'.format(
+                    pass_id, time_consumed))
                 logger.info('Evaluating the model after epoch {}'.format(
                     pass_id))
                 if brc_data.dev_set is not None:
@@ -458,7 +461,7 @@ def train(logger, args):
                 else:
                     logger.warning(
                         'No dev set is loaded for evaluation in the dataset!')
-                time_consumed = pass_end_time - pass_start_time
+
                 logger.info('Average train loss for epoch {} is {}'.format(
                     pass_id, "%.10f" % (1.0 * total_loss / total_num)))
 
diff --git a/fluid/PaddleNLP/neural_machine_translation/transformer/train.py b/fluid/PaddleNLP/neural_machine_translation/transformer/train.py
index 0e9c18416f62c85e76dd060f1fad44073e5841fc..16d48238941a03309cc9ba269cd619bd21e0f561 100644
--- a/fluid/PaddleNLP/neural_machine_translation/transformer/train.py
+++ b/fluid/PaddleNLP/neural_machine_translation/transformer/train.py
@@ -408,10 +408,19 @@ def test_context(exe, train_exe, dev_count):
     test_data = prepare_data_generator(
         args, is_test=True, count=dev_count, pyreader=pyreader)
 
-    exe.run(startup_prog)
+    exe.run(startup_prog)  # to init pyreader for testing
+    if TrainTaskConfig.ckpt_path:
+        fluid.io.load_persistables(
+            exe, TrainTaskConfig.ckpt_path, main_program=test_prog)
+
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.use_experimental_executor = True
+    build_strategy = fluid.BuildStrategy()
     test_exe = fluid.ParallelExecutor(
         use_cuda=TrainTaskConfig.use_gpu,
         main_program=test_prog,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
         share_vars_from=train_exe)
 
     def test(exe=test_exe, pyreader=pyreader):
@@ -457,7 +466,11 @@ def train_loop(exe,
                nccl2_trainer_id=0):
     # Initialize the parameters.
     if TrainTaskConfig.ckpt_path:
-        fluid.io.load_persistables(exe, TrainTaskConfig.ckpt_path)
+        exe.run(startup_prog)  # to init pyreader for training
+        logging.info("load checkpoint from {}".format(
+            TrainTaskConfig.ckpt_path))
+        fluid.io.load_persistables(
+            exe, TrainTaskConfig.ckpt_path, main_program=train_prog)
     else:
         logging.info("init fluid.framework.default_startup_program")
         exe.run(startup_prog)
@@ -469,7 +482,7 @@ def train_loop(exe,
     # For faster executor
     exec_strategy = fluid.ExecutionStrategy()
     exec_strategy.use_experimental_executor = True
-    # exec_strategy.num_iteration_per_drop_scope = 5
+    exec_strategy.num_iteration_per_drop_scope = int(args.fetch_steps)
     build_strategy = fluid.BuildStrategy()
     # Since the token number differs among devices, customize gradient scale to
     # use token average cost among multi-devices. and the gradient scale is
@@ -741,6 +754,7 @@ if __name__ == "__main__":
     LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
     logging.basicConfig(
         stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT)
+    logging.getLogger().setLevel(logging.INFO)
 
     args = parse_args()
     train(args)
diff --git a/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py b/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py
index acf98d0f15f7f493654822751fb2619de20e5505..20f00ccda5dcaafeccf5435ae32db0d5d2dcbd6a 100644
--- a/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py
+++ b/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py
@@ -38,12 +38,10 @@ def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
         for data in test_data():
             word = to_lodtensor([x[0] for x in data], place)
             mark = to_lodtensor([x[1] for x in data], place)
-            target = to_lodtensor([x[2] for x in data], place)
             crf_decode = exe.run(
                 inference_program,
                 feed={"word": word,
-                      "mark": mark,
-                      "target": target},
+                      "mark": mark},
                 fetch_list=fetch_targets,
                 return_numpy=False)
             lod_info = (crf_decode[0].lod())[0]
diff --git a/fluid/PaddleNLP/sequence_tagging_for_ner/train.py b/fluid/PaddleNLP/sequence_tagging_for_ner/train.py
index f18300e1f11d3021a24bc38767238bc2b86b7c98..b77c081ba38015e1829fcc6c633e7fbaa4376bb1 100644
--- a/fluid/PaddleNLP/sequence_tagging_for_ner/train.py
+++ b/fluid/PaddleNLP/sequence_tagging_for_ner/train.py
@@ -30,7 +30,9 @@ def test(exe, chunk_evaluator, inference_program, test_data, test_fetch_list,
         num_infer = np.array(rets[0])
         num_label = np.array(rets[1])
         num_correct = np.array(rets[2])
-        chunk_evaluator.update(num_infer[0], num_label[0], num_correct[0])
+        chunk_evaluator.update(num_infer[0].astype('int64'),
+                               num_label[0].astype('int64'),
+                               num_correct[0].astype('int64'))
     return chunk_evaluator.eval()
 
 
@@ -61,9 +63,6 @@ def main(train_data_file,
     avg_cost, feature_out, word, mark, target = ner_net(
         word_dict_len, label_dict_len, parallel)
 
-    sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
-    sgd_optimizer.minimize(avg_cost)
-
     crf_decode = fluid.layers.crf_decoding(
         input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
 
@@ -77,6 +76,8 @@ def main(train_data_file,
 
     inference_program = fluid.default_main_program().clone(for_test=True)
     test_fetch_list = [num_infer_chunks, num_label_chunks, num_correct_chunks]
+    sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+    sgd_optimizer.minimize(avg_cost)
 
     if "CE_MODE_X" not in os.environ:
         train_reader = paddle.batch(
@@ -135,7 +136,7 @@ def main(train_data_file,
               " pass_f1_score:" + str(test_pass_f1_score))
 
         save_dirname = os.path.join(model_save_dir, "params_pass_%d" % pass_id)
-        fluid.io.save_inference_model(save_dirname, ['word', 'mark', 'target'],
+        fluid.io.save_inference_model(save_dirname, ['word', 'mark'],
                                       crf_decode, exe)
 
     if "CE_MODE_X" in os.environ:
diff --git a/fluid/PaddleNLP/text_classification/README.md b/fluid/PaddleNLP/text_classification/README.md
index 43c15934fa62af3db2261be37803ce21ba6bf946..669774bac04fe906cc5bffafa1f60de60323c806 100644
--- a/fluid/PaddleNLP/text_classification/README.md
+++ b/fluid/PaddleNLP/text_classification/README.md
@@ -14,7 +14,7 @@
 
 ## 简介，模型详解
 
-在PaddlePaddle v2版本[文本分类](https://github.com/PaddlePaddle/models/blob/develop/text/README.md)中对于文本分类任务有较详细的介绍，在本例中不再重复介绍。
+在PaddlePaddle v2版本[文本分类](https://github.com/PaddlePaddle/models/blob/develop/legacy/text_classification/README.md)中对于文本分类任务有较详细的介绍，在本例中不再重复介绍。
 在模型上，我们采用了bow, cnn, lstm, gru四种常见的文本分类模型。
 
 ## 训练
diff --git a/fluid/PaddleNLP/text_classification/async_executor/README.md b/fluid/PaddleNLP/text_classification/async_executor/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0e36a8be7653787852d4d04b7603cec1046f61be
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/README.md
@@ -0,0 +1,130 @@
+# 文本分类
+
+以下是本例的简要目录结构及说明：
+
+```text
+.
+|-- README.md               # README
+|-- data_generator          # IMDB数据集生成工具
+|   |-- IMDB.py             # 在data_generator.py基础上扩展IMDB数据集处理逻辑
+|   |-- build_raw_data.py   # IMDB数据预处理，其产出被splitfile.py读取。格式：word word ... | label
+|   |-- data_generator.py   # 与AsyncExecutor配套的数据生成工具框架
+|   `-- splitfile.py        # 将build_raw_data.py生成的文件切分，其产出被IMDB.py读取
+|-- data_generator.sh       # IMDB数据集生成工具入口
+|-- data_reader.py          # 预测脚本使用的数据读取工具
+|-- infer.py                # 预测脚本
+`-- train.py                # 训练脚本
+```
+
+## 简介
+
+本目录包含用fluid.AsyncExecutor训练文本分类任务的脚本。网络模型定义沿用自父目录nets.py
+
+## 训练
+
+1. 运行命令 `sh data_generator.sh`，下载IMDB数据集，并转化成适合AsyncExecutor读取的训练数据
+2. 运行命令 `python train.py bow` 开始训练模型。
+    ```python
+    python train.py bow    # bow指定网络结构，可替换成cnn, lstm, gru
+    ```
+
+3. (可选）想自定义网络结构，需在[nets.py](../nets.py)中自行添加，并设置[train.py](./train.py)中的相应参数。
+    ```python
+    def train(train_reader,     # 训练数据
+        word_dict,              # 数据字典
+        network,                # 模型配置
+        use_cuda,               # 是否用GPU
+        parallel,               # 是否并行
+        save_dirname,           # 保存模型路径
+        lr=0.2,                 # 学习率大小
+        batch_size=128,         # 每个batch的样本数
+        pass_num=30):           # 训练的轮数
+    ```
+
+## 训练结果示例
+
+```text
+pass_id: 0 pass_time_cost 4.723438
+pass_id: 1 pass_time_cost 3.867186
+pass_id: 2 pass_time_cost 4.490111
+pass_id: 3 pass_time_cost 4.573296
+pass_id: 4 pass_time_cost 4.180547
+pass_id: 5 pass_time_cost 4.214476
+pass_id: 6 pass_time_cost 4.520387
+pass_id: 7 pass_time_cost 4.149485
+pass_id: 8 pass_time_cost 3.821354
+pass_id: 9 pass_time_cost 5.136178
+pass_id: 10 pass_time_cost 4.137318
+pass_id: 11 pass_time_cost 3.943429
+pass_id: 12 pass_time_cost 3.766478
+pass_id: 13 pass_time_cost 4.235983
+pass_id: 14 pass_time_cost 4.796462
+pass_id: 15 pass_time_cost 4.668116
+pass_id: 16 pass_time_cost 4.373798
+pass_id: 17 pass_time_cost 4.298131
+pass_id: 18 pass_time_cost 4.260021
+pass_id: 19 pass_time_cost 4.244411
+pass_id: 20 pass_time_cost 3.705138
+pass_id: 21 pass_time_cost 3.728070
+pass_id: 22 pass_time_cost 3.817919
+pass_id: 23 pass_time_cost 4.698598
+pass_id: 24 pass_time_cost 4.859262
+pass_id: 25 pass_time_cost 5.725732
+pass_id: 26 pass_time_cost 5.102599
+pass_id: 27 pass_time_cost 3.876582
+pass_id: 28 pass_time_cost 4.762538
+pass_id: 29 pass_time_cost 3.797759
+```
+与fluid.Executor不同，AsyncExecutor在每个pass结束不会将accuracy打印出来。为了观察训练过程，可以将fluid.AsyncExecutor.run()方法的Debug参数设为True，这样每个pass结束会把参数指定的fetch variable打印出来：
+
+```
+async_executor.run(
+    main_program,
+    dataset,
+    filelist,
+    thread_num,
+    [acc],
+    debug=True)
+```
+
+## 预测
+
+1. 运行命令 `python infer.py bow_model`, 开始预测。
+    ```python
+    python infer.py bow_model     # bow_model指定需要导入的模型
+    ```
+
+## 预测结果示例
+```text
+model_path: bow_model/epoch0.model, avg_acc: 0.882600
+model_path: bow_model/epoch1.model, avg_acc: 0.887920
+model_path: bow_model/epoch2.model, avg_acc: 0.886920
+model_path: bow_model/epoch3.model, avg_acc: 0.884720
+model_path: bow_model/epoch4.model, avg_acc: 0.879760
+model_path: bow_model/epoch5.model, avg_acc: 0.876920
+model_path: bow_model/epoch6.model, avg_acc: 0.874160
+model_path: bow_model/epoch7.model, avg_acc: 0.872000
+model_path: bow_model/epoch8.model, avg_acc: 0.870360
+model_path: bow_model/epoch9.model, avg_acc: 0.868480
+model_path: bow_model/epoch10.model, avg_acc: 0.867240
+model_path: bow_model/epoch11.model, avg_acc: 0.866200
+model_path: bow_model/epoch12.model, avg_acc: 0.865560
+model_path: bow_model/epoch13.model, avg_acc: 0.865160
+model_path: bow_model/epoch14.model, avg_acc: 0.864480
+model_path: bow_model/epoch15.model, avg_acc: 0.864240
+model_path: bow_model/epoch16.model, avg_acc: 0.863800
+model_path: bow_model/epoch17.model, avg_acc: 0.863520
+model_path: bow_model/epoch18.model, avg_acc: 0.862760
+model_path: bow_model/epoch19.model, avg_acc: 0.862680
+model_path: bow_model/epoch20.model, avg_acc: 0.862240
+model_path: bow_model/epoch21.model, avg_acc: 0.862280
+model_path: bow_model/epoch22.model, avg_acc: 0.862080
+model_path: bow_model/epoch23.model, avg_acc: 0.861560
+model_path: bow_model/epoch24.model, avg_acc: 0.861280
+model_path: bow_model/epoch25.model, avg_acc: 0.861160
+model_path: bow_model/epoch26.model, avg_acc: 0.861080
+model_path: bow_model/epoch27.model, avg_acc: 0.860920
+model_path: bow_model/epoch28.model, avg_acc: 0.860800
+model_path: bow_model/epoch29.model, avg_acc: 0.860760
+```
+注：过拟合导致acc持续下降，请忽略
diff --git a/fluid/PaddleNLP/text_classification/async_executor/data_generator.sh b/fluid/PaddleNLP/text_classification/async_executor/data_generator.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bb8b197afa7197b5d12eb3fc76cca66958722411
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator.sh
@@ -0,0 +1,43 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+pushd .
+cd ./data_generator
+
+# wget "http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz"
+if [ ! -f aclImdb_v1.tar.gz ]; then
+    wget "http://10.64.74.104:8080/paddle/dataset/imdb/aclImdb_v1.tar.gz"
+fi
+tar zxvf aclImdb_v1.tar.gz
+
+mkdir train_data
+python build_raw_data.py train | python splitfile.py 12 train_data
+
+mkdir test_data
+python build_raw_data.py test | python splitfile.py 12 test_data
+
+/opt/python27/bin/python IMDB.py train_data
+/opt/python27/bin/python IMDB.py test_data
+
+mv ./output_dataset/train_data ../
+mv ./output_dataset/test_data ../
+cp aclImdb/imdb.vocab ../
+
+rm -rf ./output_dataset
+rm -rf train_data
+rm -rf test_data
+rm -rf aclImdb
+popd
diff --git a/fluid/PaddleNLP/text_classification/async_executor/data_generator/IMDB.py b/fluid/PaddleNLP/text_classification/async_executor/data_generator/IMDB.py
new file mode 100644
index 0000000000000000000000000000000000000000..579df4e0e722d245cabc366ffaeeab71dbf2aa0a
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/IMDB.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import re
+import os, sys
+sys.path.append(os.path.abspath(os.path.join('..')))
+from data_generator import MultiSlotDataGenerator
+
+
+class IMDbDataGenerator(MultiSlotDataGenerator):
+    def load_resource(self, dictfile):
+        self._vocab = {}
+        wid = 0
+        with open(dictfile) as f:
+            for line in f:
+                self._vocab[line.strip()] = wid
+                wid += 1
+        self._unk_id = len(self._vocab)
+        self._pattern = re.compile(r'(;|,|\.|\?|!|\s|\(|\))')
+
+    def process(self, line):
+        send = '|'.join(line.split('|')[:-1]).lower().replace("<br />",
+                                                              " ").strip()
+        label = [int(line.split('|')[-1])]
+
+        words = [x for x in self._pattern.split(send) if x and x != " "]
+        feas = [
+            self._vocab[x] if x in self._vocab else self._unk_id for x in words
+        ]
+
+        return ("words", feas), ("label", label)
+
+
+imdb = IMDbDataGenerator()
+imdb.load_resource("aclImdb/imdb.vocab")
+
+# data from files
+file_names = os.listdir(sys.argv[1])
+filelist = []
+for i in range(0, len(file_names)):
+    filelist.append(os.path.join(sys.argv[1], file_names[i]))
+
+line_limit = 2500
+process_num = 24
+imdb.run_from_files(
+    filelist=filelist,
+    line_limit=line_limit,
+    process_num=process_num,
+    output_dir=('output_dataset/%s' % (sys.argv[1])))
diff --git a/fluid/PaddleNLP/text_classification/async_executor/data_generator/data_generator.py b/fluid/PaddleNLP/text_classification/async_executor/data_generator/data_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..70d1e1f9a020be13f43129cf26964c860ae2ce4f
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/data_generator.py
@@ -0,0 +1,508 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import multiprocessing
+__all__ = ['MultiSlotDataGenerator']
+
+
+class DataGenerator(object):
+    def __init__(self):
+        self._proto_info = None
+
+    def _set_filelist(self, filelist):
+        if not isinstance(filelist, list) and not isinstance(filelist, tuple):
+            raise ValueError("filelist%s must be in list or tuple type" %
+                             type(filelist))
+        if not filelist:
+            raise ValueError("filelist can not be empty")
+        self._filelist = filelist
+
+    def _set_process_num(self, process_num):
+        if not isinstance(process_num, int):
+            raise ValueError("process_num%s must be in int type" %
+                             type(process_num))
+        if process_num < 1:
+            raise ValueError("process_num can not less than 1")
+        self._process_num = process_num
+
+    def _set_line_limit(self, line_limit):
+        if not isinstance(line_limit, int):
+            raise ValueError("line_limit%s must be in int type" %
+                             type(line_limit))
+        if line_limit < 1:
+            raise ValueError("line_limit can not less than 1")
+        self._line_limit = line_limit
+
+    def _set_output_dir(self, output_dir):
+        if not isinstance(output_dir, str):
+            raise ValueError("output_dir%s must be in str type" %
+                             type(output_dir))
+        if not output_dir:
+            raise ValueError("output_dir can not be empty")
+        self._output_dir = output_dir
+
+    def _set_output_prefix(self, output_prefix):
+        if not isinstance(output_prefix, str):
+            raise ValueError("output_prefix%s must be in str type" %
+                             type(output_prefix))
+        self._output_prefix = output_prefix
+
+    def _set_output_fill_digit(self, output_fill_digit):
+        if not isinstance(output_fill_digit, int):
+            raise ValueError("output_fill_digit%s must be in int type" %
+                             type(output_fill_digit))
+        if output_fill_digit < 1:
+            raise ValueError("output_fill_digit can not less than 1")
+        self._output_fill_digit = output_fill_digit
+
+    def _set_proto_filename(self, proto_filename):
+        if not isinstance(proto_filename, str):
+            raise ValueError("proto_filename%s must be in str type" %
+                             type(proto_filename))
+        if not proto_filename:
+            raise ValueError("proto_filename can not be empty")
+        self._proto_filename = proto_filename
+
+    def _print_info(self):
+        '''
+        Print the configuration information
+        (Called only in the run_from_stdin function).
+        '''
+        sys.stderr.write("=" * 16 + " config " + "=" * 16 + "\n")
+        sys.stderr.write(" filelist size: %d\n" % len(self._filelist))
+        sys.stderr.write(" process num: %d\n" % self._process_num)
+        sys.stderr.write(" line limit: %d\n" % self._line_limit)
+        sys.stderr.write(" output dir: %s\n" % self._output_dir)
+        sys.stderr.write(" output prefix: %s\n" % self._output_prefix)
+        sys.stderr.write(" output fill digit: %d\n" % self._output_fill_digit)
+        sys.stderr.write(" proto filename: %s\n" % self._proto_filename)
+        sys.stderr.write("==== This may take a few minutes... ====\n")
+
+    def _get_output_filename(self, output_index, lock=None):
+        '''
+        This function is used to get the name of the output file and
+        update output_index.
+        Args:
+            output_index(manager.Value(i)): the index of output file.
+            lock(manager.Lock): The lock for processes safe.
+        Return:
+            Return the name(string) of output file.
+        '''
+        if lock is not None: lock.acquire()
+        file_index = output_index.value
+        output_index.value += 1
+        if lock is not None: lock.release()
+        filename = os.path.join(self._output_dir, self._output_prefix) \
+                + str(file_index).zfill(self._output_fill_digit)
+        sys.stderr.write("[%d] write data to file: %s\n" %
+                         (os.getpid(), filename))
+        return filename
+
+    def run_from_stdin(self,
+                       is_local=True,
+                       hadoop_host=None,
+                       hadoop_ugi=None,
+                       proto_path=None,
+                       proto_filename="data_feed.proto"):
+        '''
+        This function reads the data row from stdin, parses it with the
+        process function, and further parses the return value of the 
+        process function with the _gen_str function. The parsed data will
+        be wrote to stdout and the corresponding protofile will be
+        generated. If local is set to False, the protofile will be
+        uploaded to hadoop.
+        Args:
+            is_local(bool): Whether to execute locally. If it is False, the
+                            protofile will be uploaded to hadoop. The
+                            default value is True.
+            hadoop_host(str): The host name of the hadoop. It should be
+                              in this format: "hdfs://${HOST}:${PORT}".
+            hadoop_ugi(str): The ugi of the hadoop. It should be in this
+                             format: "${USERNAME},${PASSWORD}".
+            proto_path(str): The hadoop path you want to upload the
+                             protofile to.
+            proto_filename(str): The name of protofile. The default value
+                                 is "data_feed.proto". It is not
+                                 recommended to modify it.
+        '''
+        if is_local:
+            print \
+'''\033[1;34m=======================================================
+ Pay attention to that the version of Python in Hadoop
+ may inconsistent with local version. Please check the
+ Python version of Hadoop to ensure that it is >= 2.7.
+=======================================================\033[0m'''
+        else:
+            if hadoop_ugi is None or \
+               hadoop_host is None or \
+               proto_path is None:
+                raise ValueError(
+                    "pls set hadoop_ugi, hadoop_host, and proto_path")
+        self._set_proto_filename(proto_filename)
+        for line in sys.stdin:
+            user_parsed_line = self.process(line)
+            sys.stdout.write(self._gen_str(user_parsed_line))
+        if self._proto_info is not None:
+            # maybe some task do not catch files
+            with open(self._proto_filename, "w") as f:
+                f.write(self._get_proto_desc(self._proto_info))
+            if is_local == False:
+                cmd = "$HADOOP_HOME/bin/hadoop fs" \
+                    + " -Dhadoop.job.ugi=" + hadoop_ugi \
+                    + " -Dfs.default.name=" + hadoop_host \
+                    + " -put " + self._proto_filename + " " + proto_path
+                os.system(cmd)
+
+    def run_from_files(self,
+                       filelist,
+                       line_limit,
+                       process_num=1,
+                       output_dir="./output_dataset",
+                       output_prefix="part-",
+                       output_fill_digit=8,
+                       proto_filename="data_feed.proto"):
+        '''
+        This function will run process_num processes to process the files
+        in the filelist. It will create the output data folder(output_dir)
+        in the current directory, and write the processed data into the
+        output_dir folder(each file line_limit data, the prefix of filename
+        is output_prefix, the suffix of filename is output_fill_digit
+        numbers). And the proto_info is generated at the same time. the
+        name of proto file will be proto_filename.
+        Args:
+            filelist(list or tuple): Files that need to be processed.
+            line_limit(int): Maximum number of data stored per file.
+            process_num(int): Number of processes running simultaneously.
+            output_dir(str): The name of the folder where the output
+                             data file is stored.
+            output_prefix(str): The prefix of output data file.
+            output_fill_digit(int): The number of suffix numbers of the
+                                    output data file.
+            proto_filename(str): The name of protofile.
+        '''
+        self._set_filelist(filelist)
+        self._set_line_limit(line_limit)
+        self._set_process_num(min(process_num, len(filelist)))
+        self._set_output_dir(output_dir)
+        self._set_output_prefix(output_prefix)
+        self._set_output_fill_digit(output_fill_digit)
+        self._set_proto_filename(proto_filename)
+        self._print_info()
+
+        if not os.path.exists(self._output_dir):
+            os.makedirs(self._output_dir)
+        elif not os.path.isdir(self._output_dir):
+            raise ValueError("%s is not a directory" % self._output_dir)
+
+        processes = multiprocessing.Pool()
+        manager = multiprocessing.Manager()
+        output_index = manager.Value('i', 0)
+        file_queue = manager.Queue()
+        lock = manager.Lock()
+        remaining_queue = manager.Queue()
+        for file in self._filelist:
+            file_queue.put(file)
+        info_result = []
+        for i in range(self._process_num):
+            info_result.append(processes.apply_async(subprocess_wrapper, \
+                    (self, file_queue, remaining_queue, output_index, lock, )))
+        processes.close()
+        processes.join()
+
+        infos = [
+            result.get() for result in info_result if result.get() is not None
+        ]
+        proto_info = self._combine_infos(infos)
+        with open(os.path.join(self._output_dir, self._proto_filename),
+                  "w") as f:
+            f.write(self._get_proto_desc(proto_info))
+
+        while not remaining_queue.empty():
+            with open(self._get_output_filename(output_index), "w") as f:
+                for i in range(min(self._line_limit, remaining_queue.qsize())):
+                    f.write(remaining_queue.get(False))
+
+    def _subprocess(self, file_queue, remaining_queue, output_index, lock):
+        '''
+        This function will be called by multiple processes. It is used to
+        continuously fetch files from file_queue, using process() function
+        (defined by user) and _gen_str() function(defined by concrete classes)
+        to process data in units of rows. Write the processed data to the
+        file(each file will be self._line_limit line). If the file in the
+        file_queue has been consumed, but the file is not full, the data
+        that is less than the self._line_limit line will be stored in the
+        remaining_queue.
+        Args:
+            file_queue(manager.Queue): The queue contains all the file
+                                       names to be processed.
+            remaining_queue(manager.Queue): The queue contains the data that
+                                            is less than the self._line_limit
+                                            line.
+            output_index(manager.Value(i)): The index(suffix) of the
+                                            output file.
+            lock(manager.Lock): The lock for processes safe.
+        Returns:
+            Return a proto_info which can be translated into a proto string.
+        '''
+        buffer = []
+        while not file_queue.empty():
+            try:
+                filename = file_queue.get(False)
+            except:  # file_queue empty
+                break
+            with open(filename, 'r') as f:
+                for line in f:
+                    buffer.append(self._gen_str(self.process(line)))
+                    if len(buffer) == self._line_limit:
+                        with open(
+                                self._get_output_filename(output_index, lock),
+                                "w") as wf:
+                            for x in buffer:
+                                wf.write(x)
+                        buffer = []
+        if buffer:
+            for x in buffer:
+                remaining_queue.put(x)
+        return self._proto_info
+
+    def _gen_str(self, line):
+        '''
+        Further processing the output of the process() function rewritten by
+        user, outputting data that can be directly read by the datafeed,and
+        updating proto_info infomation.
+        Args:
+            line(str): the output of the process() function rewritten by user.
+        Returns:
+            Return a string data that can be read directly by the datafeed.
+        '''
+        raise NotImplementedError(
+            "pls use MultiSlotDataGenerator or PairWiseDataGenerator")
+
+    def _combine_infos(self, infos):
+        '''
+        This function is used to merge proto_info information from different
+        processes. In general, the proto_info of each process is consistent.
+        Args:
+            infos(list): the list of proto_infos from different processes.
+        Returns:
+            Return a unified proto_info.
+        '''
+        raise NotImplementedError(
+            "pls use MultiSlotDataGenerator or PairWiseDataGenerator")
+
+    def _get_proto_desc(self, proto_info):
+        '''
+        This function outputs the string of the proto file(can be directly
+        written to the file) according to the proto_info information.
+        Args:
+            proto_info: The proto information used to generate the proto
+                        string. The type of the variable will be determined
+                        by the subclass. In the MultiSlotDataGenerator,
+                        proto_info variable is a list of tuple.
+        Returns:
+            Returns a string of the proto file.
+        '''
+        raise NotImplementedError(
+            "pls use MultiSlotDataGenerator or PairWiseDataGenerator")
+
+    def process(self, line):
+        '''
+        This function needs to be overridden by the user to process the 
+        original data row into a list or tuple.
+        Args:
+            line(str): the original data row
+        Returns:
+            Returns the data processed by the user.
+              The data format is list or tuple: 
+            [(name, [feasign, ...]), ...] 
+              or ((name, [feasign, ...]), ...)
+             
+            For example:
+            [("words", [1926, 08, 17]), ("label", [1])]
+              or (("words", [1926, 08, 17]), ("label", [1]))
+        Note:
+            The type of feasigns must be in int or float. Once the float
+            element appears in the feasign, the type of that slot will be
+            processed into a float.
+        '''
+        raise NotImplementedError(
+            "pls rewrite this function to return a list or tuple: " +
+            "[(name, [feasign, ...]), ...] or ((name, [feasign, ...]), ...)")
+
+
+def subprocess_wrapper(instance, file_queue, remaining_queue, output_index,
+                       lock):
+    '''
+    In order to use the class function as a process, you need to wrap it.
+    '''
+    return instance._subprocess(file_queue, remaining_queue, output_index, lock)
+
+
+class MultiSlotDataGenerator(DataGenerator):
+    def _combine_infos(self, infos):
+        '''
+        This function is used to merge proto_info information from different
+        processes. In general, the proto_info of each process is consistent.
+        The type of input infos is list, and the type of element of infos is
+        tuple. The format of element of infos will be (name, type).
+        Args:
+            infos(list): the list of proto_infos from different processes.
+        Returns:
+            Return a unified proto_info.
+        Note:
+            This function is only called by the run_from_files function, so
+            when using the run_from_stdin function(usually used for hadoop),
+            the output of the process function(rewritten by the user) does
+            not allow that the same field to have both float and int type
+            values.
+        '''
+        proto_info = infos[0]
+        for info in infos:
+            for index, slot in enumerate(info):
+                name, type = slot
+                if name != proto_info[index][0]:
+                    raise ValueError(
+                        "combine infos error, pls contact the maintainer of this code~"
+                    )
+                if type == "float" and proto_info[index][1] == "uint64":
+                    proto_info[index] = (name, type)
+        return proto_info
+
+    def _get_proto_desc(self, proto_info):
+        '''
+        Generate a string of proto file based on the proto_info information.
+        
+        The proto_info will be a list of tuples:
+            >>> [(Name, Type), ...]
+        
+        The string of proto file will be in this format:
+            >>> name: "MultiSlotDataFeed"
+            >>> batch_size: 32
+            >>> multi_slot_desc {
+            >>>     slots {
+            >>>         name: Name
+            >>>         type: Type
+            >>>         is_dense: false
+            >>>         is_used: false
+            >>>     }
+            >>> }
+        Args:
+            proto_info(list): The proto information used to generate the
+                              proto string.
+        Returns:
+            Returns a string of the proto file.
+        '''
+        proto_str = "name: \"MultiSlotDataFeed\"\n" \
+                + "batch_size: 32\nmulti_slot_desc {\n"
+        for elem in proto_info:
+            proto_str += "  slots {\n" \
+                       + "    name: \"%s\"\n" % elem[0]\
+                       + "    type: \"%s\"\n" % elem[1]\
+                       + "    is_dense: false\n" \
+                       + "    is_used: false\n" \
+                       + "  }\n"
+        proto_str += "}"
+        return proto_str
+
+    def _gen_str(self, line):
+        '''
+        Further processing the output of the process() function rewritten by
+        user, outputting data that can be directly read by the MultiSlotDataFeed,
+        and updating proto_info infomation.
+        The input line will be in this format:
+            >>> [(name, [feasign, ...]), ...] 
+            >>> or ((name, [feasign, ...]), ...)
+        The output will be in this format:
+            >>> [ids_num id1 id2 ...] ...
+        The proto_info will be in this format:
+            >>> [(name, type), ...]
+        
+        For example, if the input is like this:
+            >>> [("words", [1926, 08, 17]), ("label", [1])]
+            >>> or (("words", [1926, 08, 17]), ("label", [1]))
+        the output will be:
+            >>> 3 1234 2345 3456 1 1
+        the proto_info will be:
+            >>> [("words", "uint64"), ("label", "uint64")]
+        Args:
+            line(str): the output of the process() function rewritten by user.
+        Returns:
+            Return a string data that can be read directly by the MultiSlotDataFeed.
+        '''
+        if not isinstance(line, list) and not isinstance(line, tuple):
+            raise ValueError(
+                "the output of process() must be in list or tuple type")
+        output = ""
+
+        if self._proto_info is None:
+            self._proto_info = []
+            for item in line:
+                name, elements = item
+                if not isinstance(name, str):
+                    raise ValueError("name%s must be in str type" % type(name))
+                if not isinstance(elements, list):
+                    raise ValueError("elements%s must be in list type" %
+                                     type(elements))
+                if not elements:
+                    raise ValueError(
+                        "the elements of each field can not be empty, you need padding it in process()."
+                    )
+                self._proto_info.append((name, "uint64"))
+                if output:
+                    output += " "
+                output += str(len(elements))
+                for elem in elements:
+                    if isinstance(elem, float):
+                        self._proto_info[-1] = (name, "float")
+                    elif not isinstance(elem, int) and not isinstance(elem,
+                                                                      long):
+                        raise ValueError(
+                            "the type of element%s must be in int or float" %
+                            type(elem))
+                    output += " " + str(elem)
+        else:
+            if len(line) != len(self._proto_info):
+                raise ValueError(
+                    "the complete field set of two given line are inconsistent.")
+            for index, item in enumerate(line):
+                name, elements = item
+                if not isinstance(name, str):
+                    raise ValueError("name%s must be in str type" % type(name))
+                if not isinstance(elements, list):
+                    raise ValueError("elements%s must be in list type" %
+                                     type(elements))
+                if not elements:
+                    raise ValueError(
+                        "the elements of each field can not be empty, you need padding it in process()."
+                    )
+                if name != self._proto_info[index][0]:
+                    raise ValueError(
+                        "the field name of two given line are not match: require<%s>, get<%d>."
+                        % (self._proto_info[index][0], name))
+                if output:
+                    output += " "
+                output += str(len(elements))
+                for elem in elements:
+                    if self._proto_info[index][1] != "float":
+                        if isinstance(elem, float):
+                            self._proto_info[index] = (name, "float")
+                        elif not isinstance(elem, int) and not isinstance(elem,
+                                                                          long):
+                            raise ValueError(
+                                "the type of element%s must be in int or float"
+                                % type(elem))
+                    output += " " + str(elem)
+        return output + "\n"
diff --git a/fluid/PaddleNLP/text_classification/async_executor/data_generator/splitfile.py b/fluid/PaddleNLP/text_classification/async_executor/data_generator/splitfile.py
new file mode 100644
index 0000000000000000000000000000000000000000..414e097c5a6352673c00e487230e38bf64e6299a
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/splitfile.py
@@ -0,0 +1,29 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Split file into parts
+"""
+import sys
+import os
+block = int(sys.argv[1])
+datadir = sys.argv[2]
+file_list = []
+for i in range(block):
+    file_list.append(open(datadir + "/part-" + str(i), "w"))
+id_ = 0
+for line in sys.stdin:
+    file_list[id_ % block].write(line)
+    id_ += 1
+for f in file_list:
+    f.close()
diff --git a/fluid/PaddleNLP/text_classification/async_executor/data_reader.py b/fluid/PaddleNLP/text_classification/async_executor/data_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..0cdad76295ea7d6d2eb7f2982aa5cb6b830d4976
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_reader.py
@@ -0,0 +1,50 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import os
+import paddle
+
+
+def parse_fields(fields):
+    words_width = int(fields[0])
+    words = fields[1:1 + words_width]
+    label = fields[-1]
+
+    return words, label
+
+
+def imdb_data_feed_reader(data_dir, batch_size, buf_size):
+    """ 
+    Data feed reader for IMDB dataset.
+    This data set has been converted from original format to a format suitable
+    for AsyncExecutor
+    See data.proto for data format
+    """
+
+    def reader():
+        for file in os.listdir(data_dir):
+            if file.endswith('.proto'):
+                continue
+
+            with open(os.path.join(data_dir, file), 'r') as f:
+                for line in f:
+                    fields = line.split(' ')
+                    words, label = parse_fields(fields)
+                    yield words, label
+
+    test_reader = paddle.batch(
+        paddle.reader.shuffle(
+            reader, buf_size=buf_size), batch_size=batch_size)
+    return test_reader
diff --git a/fluid/PaddleNLP/text_classification/async_executor/infer.py b/fluid/PaddleNLP/text_classification/async_executor/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c9f53afbc992ab1804eb5995702fbcd14e7dcbf
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/infer.py
@@ -0,0 +1,79 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import time
+import unittest
+import contextlib
+import numpy as np
+
+import paddle
+import paddle.fluid as fluid
+
+import data_reader
+
+
+def infer(test_reader, use_cuda, model_path=None):
+    """
+    inference function
+    """
+    if model_path is None:
+        print(str(model_path) + " cannot be found")
+        return
+
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    inference_scope = fluid.core.Scope()
+    with fluid.scope_guard(inference_scope):
+        [inference_program, feed_target_names,
+         fetch_targets] = fluid.io.load_inference_model(model_path, exe)
+
+        total_acc = 0.0
+        total_count = 0
+        for data in test_reader():
+            acc = exe.run(inference_program,
+                          feed=utils.data2tensor(data, place),
+                          fetch_list=fetch_targets,
+                          return_numpy=True)
+            total_acc += acc[0] * len(data)
+            total_count += len(data)
+
+        avg_acc = total_acc / total_count
+        print("model_path: %s, avg_acc: %f" % (model_path, avg_acc))
+
+
+if __name__ == "__main__":
+    if __package__ is None:
+        from os import sys, path
+        sys.path.append(
+            os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+    import utils
+
+    batch_size = 128
+    model_path = sys.argv[1]
+    test_data_dirname = 'test_data'
+
+    if len(sys.argv) == 3:
+        test_data_dirname = sys.argv[2]
+
+    test_reader = data_reader.imdb_data_feed_reader(
+        'test_data', batch_size, buf_size=500000)
+
+    models = os.listdir(model_path)
+    for i in range(0, len(models)):
+        epoch_path = "epoch" + str(i) + ".model"
+        epoch_path = os.path.join(model_path, epoch_path)
+        infer(test_reader, use_cuda=False, model_path=epoch_path)
diff --git a/fluid/PaddleNLP/text_classification/async_executor/train.py b/fluid/PaddleNLP/text_classification/async_executor/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..034d65dd5bf94a717f791e04b8648d9606528d6c
--- /dev/null
+++ b/fluid/PaddleNLP/text_classification/async_executor/train.py
@@ -0,0 +1,112 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import time
+import multiprocessing
+
+import paddle
+import paddle.fluid as fluid
+
+
+def train(network, dict_dim, lr, save_dirname, training_data_dirname, pass_num,
+          thread_num, batch_size):
+    file_names = os.listdir(training_data_dirname)
+    filelist = []
+    for i in range(0, len(file_names)):
+        if file_names[i] == 'data_feed.proto':
+            continue
+        filelist.append(os.path.join(training_data_dirname, file_names[i]))
+
+    dataset = fluid.DataFeedDesc(
+        os.path.join(training_data_dirname, 'data_feed.proto'))
+    dataset.set_batch_size(
+        batch_size)  # datafeed should be assigned a batch size
+    dataset.set_use_slots(['words', 'label'])
+
+    data = fluid.layers.data(
+        name="words", shape=[1], dtype="int64", lod_level=1)
+    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+
+    avg_cost, acc, prediction = network(data, label, dict_dim)
+    optimizer = fluid.optimizer.Adagrad(learning_rate=lr)
+    opt_ops, weight_and_grad = optimizer.minimize(avg_cost)
+
+    startup_program = fluid.default_startup_program()
+    main_program = fluid.default_main_program()
+
+    place = fluid.CPUPlace()
+    executor = fluid.Executor(place)
+    executor.run(startup_program)
+
+    async_executor = fluid.AsyncExecutor(place)
+    for i in range(pass_num):
+        pass_start = time.time()
+        async_executor.run(main_program,
+                           dataset,
+                           filelist,
+                           thread_num, [acc],
+                           debug=False)
+        print('pass_id: %u pass_time_cost %f' % (i, time.time() - pass_start))
+        fluid.io.save_inference_model('%s/epoch%d.model' % (save_dirname, i),
+                                      [data.name, label.name], [acc], executor)
+
+
+if __name__ == "__main__":
+    if __package__ is None:
+        from os import sys, path
+        sys.path.append(
+            os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+
+    from nets import bow_net, cnn_net, lstm_net, gru_net
+    from utils import load_vocab
+
+    batch_size = 4
+    lr = 0.002
+    pass_num = 30
+    save_dirname = ""
+    thread_num = multiprocessing.cpu_count()
+
+    if sys.argv[1] == "bow":
+        network = bow_net
+        batch_size = 128
+        save_dirname = "bow_model"
+    elif sys.argv[1] == "cnn":
+        network = cnn_net
+        lr = 0.01
+        save_dirname = "cnn_model"
+    elif sys.argv[1] == "lstm":
+        network = lstm_net
+        lr = 0.05
+        save_dirname = "lstm_model"
+    elif sys.argv[1] == "gru":
+        network = gru_net
+        batch_size = 128
+        lr = 0.05
+        save_dirname = "gru_model"
+
+    training_data_dirname = 'train_data/'
+    if len(sys.argv) == 3:
+        training_data_dirname = sys.argv[2]
+
+    if len(sys.argv) == 4:
+        if thread_num >= int(sys.argv[3]):
+            thread_num = int(sys.argv[3])
+
+    vocab = load_vocab('imdb.vocab')
+    dict_dim = len(vocab)
+
+    train(network, dict_dim, lr, save_dirname, training_data_dirname, pass_num,
+          thread_num, batch_size)
diff --git a/fluid/PaddleNLP/text_classification/train.py b/fluid/PaddleNLP/text_classification/train.py
index 159266f3956b950afa200e9f53c9fdc6c36309aa..174636f06ec5fe07180347745f910166140e9eed 100644
--- a/fluid/PaddleNLP/text_classification/train.py
+++ b/fluid/PaddleNLP/text_classification/train.py
@@ -89,7 +89,7 @@ def train(train_reader,
 
 def train_net():
     word_dict, train_reader, test_reader = utils.prepare_data(
-        "imdb", self_dict=False, batch_size=4, buf_size=50000)
+        "imdb", self_dict=False, batch_size=128, buf_size=50000)
 
     if sys.argv[1] == "bow":
         train(
diff --git a/fluid/PaddleNLP/text_matching_on_quora/.run_ce.sh b/fluid/PaddleNLP/text_matching_on_quora/.run_ce.sh
new file mode 100755
index 0000000000000000000000000000000000000000..f1bb7febd3f2c572544612baf24be14c711108e3
--- /dev/null
+++ b/fluid/PaddleNLP/text_matching_on_quora/.run_ce.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+
+cudaid=${text_matching_on_quora:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train_and_evaluate.py --model_name=cdssmNet --config=cdssm_base --enable_ce --epoch_num=5 | python _ce.py
+
+cudaid=${text_matching_on_quora_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+
+FLAGS_benchmark=true  python train_and_evaluate.py --model_name=cdssmNet --config=cdssm_base --enable_ce --epoch_num=5 | python _ce.py
diff --git a/fluid/PaddleNLP/text_matching_on_quora/__init__.py b/fluid/PaddleNLP/text_matching_on_quora/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/fluid/PaddleNLP/text_matching_on_quora/_ce.py b/fluid/PaddleNLP/text_matching_on_quora/_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..eadeb821da6f7049d1916a65a1ae4eb995c5cb6d
--- /dev/null
+++ b/fluid/PaddleNLP/text_matching_on_quora/_ce.py
@@ -0,0 +1,65 @@
+# this file is only used for continuous evaluation test!
+
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+
+
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.08, 0, actived=True)
+train_avg_cost_card1_kpi = CostKpi('train_avg_cost_card1', 0.08, 0)
+train_avg_acc_card1_kpi = CostKpi('train_avg_acc_card1', 0.02, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.08, 0, actived=True)
+train_avg_cost_card4_kpi = CostKpi('train_avg_cost_card4', 0.08, 0)
+train_avg_acc_card4_kpi = CostKpi('train_avg_acc_card4', 0.02, 0)
+
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_avg_cost_card1_kpi,
+        train_avg_acc_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_avg_cost_card4_kpi,
+        train_avg_acc_card4_kpi,
+        ]
+
+
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+
+    The suggestion:
+
+    each line in the log should be key, value, for example:
+
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+
+
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+
+
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
diff --git a/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py b/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py
index a3f3422e8bb2d4065978d43378e6c607b00141a4..cda934d3402d1fd05432e75e3b7bfd0a1bd4ad2c 100755
--- a/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py
@@ -21,7 +21,12 @@ import numpy as np
 import time, datetime
 import os, sys
 
-
+def maybe_open(filepath):
+    if sys.version_info <= (3, 0): # for python2
+        return open(filepath, 'r')
+    else:
+        return open(filepath, 'r', encoding="utf-8")
+    
 def Glove840B_300D(filepath, keys=None):
     """
     input: the "glove.840B.300d.txt" file path
@@ -33,7 +38,7 @@ def Glove840B_300D(filepath, keys=None):
     print("please wait for a minute.")
     start = time.time()
     word2vec = {}
-    with open(filepath, "r") as f:
+    with maybe_open(filepath) as f:
         for line in f:
             if sys.version_info <= (3, 0): # for python2
                 line = line.decode('utf-8')
diff --git a/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py b/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py
index d27fa4fdeb598b84fa2069df01951158f78b1834..4a1694929dc9a5a1d78bce2f99be04de0f1ba8e5 100755
--- a/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py
@@ -68,8 +68,10 @@ def maybe_open(file_name):
                 "     |- readme.txt\n"
                 "     |- wordvec.txt\n")
         raise RuntimeError(msg)
-
-    return open(file_name, 'r')
+    if sys.version_info <= (3, 0): # for python2
+        return open(file_name, 'r')
+    else:
+        return open(file_name, 'r', encoding="utf-8")
 
 
 def tokenized_question_pairs(file_name):
diff --git a/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py b/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py
index 0cca171933fac9dfc47baaf45d551b65d69c2f7a..0f88c6b6ef13aec25e08527b7efabe8638a3af25 100755
--- a/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py
@@ -33,6 +33,8 @@ parser = argparse.ArgumentParser(description=__doc__)
 
 parser.add_argument('--model_name',       type=str,   default='cdssmNet',                  help="Which model to train")
 parser.add_argument('--config',           type=str,   default='cdssm_base',       help="The global config setting")
+parser.add_argument('--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
+parser.add_argument('--epoch_num', type=int, help='Number of epoch')
 
 DATA_DIR = os.path.join(os.path.expanduser('~'), '.cache/paddle/dataset')
 
@@ -139,6 +141,13 @@ def train_and_evaluate(train_reader,
     else:
         feeder = fluid.DataFeeder(feed_list=[q1, q2, mask1, mask2, label], place=place)
 
+    # only for ce
+    args = parser.parse_args()
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_startup_program().random_seed = SEED
+        fluid.default_main_program().random_seed = SEED
+
     # logging param info
     for param in fluid.default_main_program().global_block().all_parameters():
         print("param name: %s; param shape: %s" % (param.name, param.shape))
@@ -167,8 +176,10 @@ def train_and_evaluate(train_reader,
              metric_type=global_config.metric_type)
 
     # start training
+    total_time = 0.0
     print("[%s] Start Training" % time.asctime(time.localtime(time.time())))
     for epoch_id in range(global_config.epoch_num):
+
         data_size, data_count, total_acc, total_cost = 0, 0, 0.0, 0.0
         batch_id = 0
         epoch_begin_time = time.time()
@@ -177,8 +188,8 @@ def train_and_evaluate(train_reader,
                                               feed=feeder.feed(data),
                                               fetch_list=[cost, acc])
             data_size = len(data)
-            total_acc += data_size * avg_acc_np
-            total_cost += data_size * avg_cost_np
+            total_acc += data_size * avg_acc_np[0]
+            total_cost += data_size * avg_cost_np[0]
             data_count += data_size
             if batch_id % 100 == 0:
                 print("[%s] epoch_id: %d, batch_id: %d, cost: %f, acc: %f" % (
@@ -188,16 +199,30 @@ def train_and_evaluate(train_reader,
                     avg_cost_np,
                     avg_acc_np))
             batch_id += 1
-        
         avg_cost = total_cost / data_count
         avg_acc = total_acc / data_count
-        
+        epoch_end_time = time.time()
+        total_time += epoch_end_time - epoch_begin_time
+
         print("")
         print("[%s] epoch_id: %d, train_avg_cost: %f, train_avg_acc: %f, epoch_time_cost: %f" % (
             time.asctime( time.localtime(time.time())),
             epoch_id, avg_cost, avg_acc,
             time.time() - epoch_begin_time))
 
+        # only for ce
+        if epoch_id == global_config.epoch_num - 1 and args.enable_ce:
+            #Note: The following logs are special for CE monitoring.
+            #Other situations do not need to care about these logs.
+            gpu_num = get_cards(args)
+            print("kpis\teach_pass_duration_card%s\t%s" % \
+                  (gpu_num, total_time / (global_config.epoch_num)))
+            print("kpis\ttrain_avg_cost_card%s\t%s" %
+                  (gpu_num, avg_cost))
+            print("kpis\ttrain_avg_acc_card%s\t%s" %
+                  (gpu_num, avg_acc))
+
+
         epoch_model = global_config.save_dirname + "/" + "epoch" + str(epoch_id)
         fluid.io.save_inference_model(epoch_model, ["question1", "question2", "label"], acc, exe)    
         
@@ -217,6 +242,9 @@ def main():
     args = parser.parse_args()
     global_config = configs.__dict__[args.config]()
 
+    if args.epoch_num != None:
+        global_config.epoch_num = args.epoch_num
+
     print("net_name: ", args.model_name)
     net = models.__dict__[args.model_name](global_config)
 
@@ -267,5 +295,15 @@ def main():
                    use_cuda=global_config.use_cuda,
                    parallel=False)
 
+
+def get_cards(args):
+    if args.enable_ce:
+        cards = os.environ.get('CUDA_VISIBLE_DEVICES')
+        num = len(cards.split(","))
+        return num
+    else:
+        return args.num_devices
+
+
 if __name__ == "__main__":
     main()
diff --git a/fluid/PaddleRec/ctr/data/download.sh b/fluid/PaddleRec/ctr/data/download.sh
index 466a22f2c6cc885cea0a1468f3043cb59c611b59..f5c301ee48f6ef10bf80079a6820056571c4dbc7 100755
--- a/fluid/PaddleRec/ctr/data/download.sh
+++ b/fluid/PaddleRec/ctr/data/download.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz
+wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz
 tar zxf dac.tar.gz
 rm -f dac.tar.gz
 
diff --git a/fluid/PaddleRec/ctr/network_conf.py b/fluid/PaddleRec/ctr/network_conf.py
index fa7dc11b00941c75453ef6165a74f61ad772d1bf..54dd855928d0a25dff99c0b5165c1b1343732138 100644
--- a/fluid/PaddleRec/ctr/network_conf.py
+++ b/fluid/PaddleRec/ctr/network_conf.py
@@ -78,7 +78,7 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim, sparse_
             param_attr=sparse_fm_param_attr, is_sparse=True)
         return fluid.layers.sequence_pool(input=emb, pool_type='average')
 
-    sparse_embed_seq = map(embedding_layer, sparse_input_ids)
+    sparse_embed_seq = list(map(embedding_layer, sparse_input_ids))
     concated = fluid.layers.concat(sparse_embed_seq + [dense_input], axis=1)
     fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
                           param_attr=fluid.ParamAttr(initializer=fluid.initializer.Normal(
@@ -134,7 +134,7 @@ def ctr_dnn_model(embedding_size, sparse_feature_dim):
                                                       use_double_buffer=True)
     words = fluid.layers.read_file(py_reader)
 
-    sparse_embed_seq = map(embedding_layer, words[1:-1])
+    sparse_embed_seq = list(map(embedding_layer, words[1:-1]))
     concated = fluid.layers.concat(sparse_embed_seq + words[0:1], axis=1)
 
     fc1 = fluid.layers.fc(input=concated, size=400, act='relu',
diff --git a/fluid/PaddleRec/ctr/preprocess.py b/fluid/PaddleRec/ctr/preprocess.py
index bf5673ba9f2d441c6f0e09196282a444147c682d..e6fc456c3547947ae425ff42161ff075dbfae65f 100755
--- a/fluid/PaddleRec/ctr/preprocess.py
+++ b/fluid/PaddleRec/ctr/preprocess.py
@@ -51,7 +51,7 @@ class CategoryDictGenerator:
         return res
 
     def dicts_sizes(self):
-        return map(len, self.dicts)
+        return list(map(len, self.dicts))
 
 
 class ContinuousFeatureGenerator:
@@ -61,8 +61,8 @@ class ContinuousFeatureGenerator:
 
     def __init__(self, num_feature):
         self.num_feature = num_feature
-        self.min = [sys.maxint] * num_feature
-        self.max = [-sys.maxint] * num_feature
+        self.min = [sys.maxsize] * num_feature
+        self.max = [-sys.maxsize] * num_feature
 
     def build(self, datafile, continous_features):
         with open(datafile, 'r') as f:
diff --git a/fluid/PaddleRec/gru4rec/README.md b/fluid/PaddleRec/gru4rec/README.md
index e94aea661737c8cba696ae96ccdd54fd2f974f1a..0ea3f838eaf9e2f46b7d1551a36aa1f6b462ce44 100644
--- a/fluid/PaddleRec/gru4rec/README.md
+++ b/fluid/PaddleRec/gru4rec/README.md
@@ -32,7 +32,15 @@ GRU4REC模型的介绍可以参阅论文[Session-based Recommendations with Recu
 
 session-based推荐应用场景非常广泛，比如用户的商品浏览、新闻点击、地点签到等序列数据。
 
-支持三种形式的损失函数, 分别是全词表的cross-entropy, 采负样本的Bayesian Pairwise Ranking和采负样本的Cross-entropy.
+支持三种形式的损失函数, 分别是全词表的cross-entropy, 负采样的Bayesian Pairwise Ranking和负采样的Cross-entropy.
+
+我们基本复现了论文效果，recall@20的效果分别为
+
+全词表 cross entropy : 0.67
+
+负采样 bpr : 0.606
+
+负采样 cross entropy : 0.605
 
 
 运行样例程序可跳过'RSC15 数据下载及预处理'部分
@@ -113,31 +121,42 @@ python text2paddle.py raw_train_data/ raw_test_data/ train_data test_data vocab.
 ```
 
 ## 训练
-'--use_cuda 1' 表示使用gpu, 缺省表示使用cpu '--parallel 1' 表示使用多卡，缺省表示使用单卡
 
 具体的参数配置可运行
 ```
 python train.py -h
 ```
+全词表cross entropy 训练代码
 
-GPU 环境
-运行命令开始训练模型。
+gpu 单机单卡训练
+``` bash
+CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data --use_cuda 1 --batch_size 50 --model_dir model_output
 ```
-CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data/ --use_cuda 1
+
+cpu 单机训练
+``` bash
+python train.py --train_dir train_data --use_cuda 0 --batch_size 50 --model_dir model_output
 ```
-CPU 环境
-运行命令开始训练模型。
+
+gpu 单机多卡训练
+``` bash
+CUDA_VISIBLE_DEVICES=0,1 python train.py --train_dir train_data --use_cuda 1 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 2
 ```
-python train.py --train_dir train_data/
+
+cpu 单机多卡训练
+``` bash
+CPU_NUM=10 python train.py --train_dir train_data --use_cuda 0 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 10
 ```
 
-bayesian pairwise ranking loss(bpr loss) 训练
+负采样 bayesian pairwise ranking loss(bpr loss) 训练
 ```
 CUDA_VISIBLE_DEVICES=0 python train_sample_neg.py --loss bpr --use_cuda 1
 ```
 
-
-请注意CPU环境下运行单机多卡任务（--parallel 1)时，batch_size应大于cpu核数。
+负采样 cross entropy  训练
+```
+CUDA_VISIBLE_DEVICES=0 python train_sample_neg.py --loss ce --use_cuda 1
+```
 
 ## 自定义网络结构
 
diff --git a/fluid/PaddleRec/gru4rec/infer_sample_neg.py b/fluid/PaddleRec/gru4rec/infer_sample_neg.py
index 26c7d8eeb44d8da35e5c99bc4018090fa1698db8..0915fe18d571ba459930960d7a39735dc075c930 100644
--- a/fluid/PaddleRec/gru4rec/infer_sample_neg.py
+++ b/fluid/PaddleRec/gru4rec/infer_sample_neg.py
@@ -9,7 +9,6 @@ import six
 import paddle.fluid as fluid
 import paddle
 import net
-
 import utils
 
 
diff --git a/fluid/PaddleRec/gru4rec/net.py b/fluid/PaddleRec/gru4rec/net.py
index 8369229258d48726eb0dcc31ec9d9eb400e08856..ebb512377eae865b90f3d0360931a744b1a0ad07 100644
--- a/fluid/PaddleRec/gru4rec/net.py
+++ b/fluid/PaddleRec/gru4rec/net.py
@@ -178,7 +178,7 @@ def train_cross_entropy_network(vocab_size, neg_size, hid_size, drop_out=0.2):
     return src, pos_label, label, cost_sum
 
 
-def infer_bpr_network(vocab_size, batch_size, hid_size, dropout=0.2):
+def infer_network(vocab_size, batch_size, hid_size, dropout=0.2):
     src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
     emb_src = fluid.layers.embedding(
         input=src, size=[vocab_size, hid_size], param_attr="emb")
diff --git a/fluid/PaddleRec/ssr/README.md b/fluid/PaddleRec/ssr/README.md
index 034be994d9000591c59ca08feda54d4a39d147af..a9334a70b39f62dc4fa1fc144a0316280bfdc1ef 100644
--- a/fluid/PaddleRec/ssr/README.md
+++ b/fluid/PaddleRec/ssr/README.md
@@ -3,31 +3,47 @@
 ## Introduction
 In news recommendation scenarios, different from traditional systems that recommend entertainment items such as movies or music, there are several new problems to solve.
 - Very sparse user profile features exist that a user may login a news recommendation app anonymously and a user is likely to read a fresh news item.
-- News are generated or disappeared very fast compare with movies or musics. Usually, there will be thousands of news generated in a news recommendation app. The Consumption of news is also fast since users care about newly happened things. 
+- News are generated or disappeared very fast compare with movies or musics. Usually, there will be thousands of news generated in a news recommendation app. The Consumption of news is also fast since users care about newly happened things.
 - User interests may change frequently in the news recommendation setting. The content of news will affect users' reading behaviors a lot even the category of the news does not belong to users' long-term interest. In news recommendation, reading behaviors are determined by both short-term interest and long-term interest of users.
 
 [GRU4Rec](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec) models a user's short-term and long-term interest by applying a gated-recurrent-unit on the user's reading history. The generalization ability of recurrent neural network captures users' similarity of reading sequences that alleviates the user profile sparsity problem. However, the paper of GRU4Rec operates on close domain of items that the model predicts which item a user will be interested in through classification method. In news recommendation, news items are dynamic through time that GRU4Rec model can not predict items that do not exist in training dataset.
 
 Sequence Semantic Retrieval(SSR) Model shares the similar idea with Multi-Rate Deep Learning for Temporal Recommendation, SIGIR 2016. Sequence Semantic Retrieval Model has two components, one is the matching model part, the other one is the retrieval part.
-- The idea of SSR is to model a user's personalized interest of an item through matching model structure, and the representation of a news item can be computed online even the news item does not exist in training dataset. 
+- The idea of SSR is to model a user's personalized interest of an item through matching model structure, and the representation of a news item can be computed online even the news item does not exist in training dataset.
 - With the representation of news items, we are able to build an vector indexing service online for news prediction and this is the retrieval part of SSR.
 
 ## Dataset
 Dataset preprocessing follows the method of [GRU4Rec Project](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec). Note that you should reuse scripts from GRU4Rec project for data preprocessing.
 
 ## Training
-Before training, you should set PYTHONPATH environment
+
+The command line options for training can be listed by `python train.py -h`
+
+gpu 单机单卡训练
+``` bash
+CUDA_VISIBLE_DEVICES=0 python train.py --train_dir train_data --use_cuda 1 --batch_size 50 --model_dir model_output
 ```
-export PYTHONPATH=./models/fluid:$PYTHONPATH
+
+cpu 单机训练
+``` bash
+python train.py --train_dir train_data --use_cuda 0 --batch_size 50 --model_dir model_output
 ```
 
-The command line options for training can be listed by `python train.py -h`
+gpu 单机多卡训练
 ``` bash
-python train.py --train_file rsc15_train_tr_paddle.txt
+CUDA_VISIBLE_DEVICES=0,1 python train.py --train_dir train_data --use_cuda 1 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 2
 ```
 
-## Build Index
-TBA
+cpu 单机多卡训练
+``` bash
+CPU_NUM=10 python train.py --train_dir train_data --use_cuda 0 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 10
+```
+
+多机训练 参考fluid/PaddleRec/gru4rec下的配置
 
-## Retrieval
-TBA
+## Inference
+
+gpu 预测
+``` bash
+CUDA_VISIBLE_DEVICES=0 python infer.py --test_dir test_data --use_cuda 1 --batch_size 50 --model_dir model_output
+```
diff --git a/fluid/PaddleRec/ssr/infer.py b/fluid/PaddleRec/ssr/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d5c9ee1b5dc95eb403932e0ff7534bfadc7568d7
--- /dev/null
+++ b/fluid/PaddleRec/ssr/infer.py
@@ -0,0 +1,134 @@
+import sys
+import argparse
+import time
+import math
+import unittest
+import contextlib
+import numpy as np
+import six
+import paddle.fluid as fluid
+import paddle
+import utils
+import nets as net
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("ssr benchmark.")
+    parser.add_argument(
+        '--test_dir', type=str, default='test_data', help='test file address')
+    parser.add_argument(
+        '--vocab_path', type=str, default='vocab.txt', help='vocab path')
+    parser.add_argument(
+        '--start_index', type=int, default='1', help='start index')
+    parser.add_argument(
+        '--last_index', type=int, default='10', help='end index')
+    parser.add_argument(
+        '--model_dir', type=str, default='model_output', help='model dir')
+    parser.add_argument(
+        '--use_cuda', type=int, default='0', help='whether use cuda')
+    parser.add_argument(
+        '--batch_size', type=int, default='50', help='batch_size')
+    parser.add_argument(
+        '--hid_size', type=int, default='128', help='hidden size')
+    parser.add_argument(
+        '--emb_size', type=int, default='128', help='embedding size')
+    args = parser.parse_args()
+    return args
+
+
+def model(vocab_size, emb_size, hidden_size):
+    user_data = fluid.layers.data(
+        name="user", shape=[1], dtype="int64", lod_level=1)
+    all_item_data = fluid.layers.data(
+        name="all_item", shape=[vocab_size, 1], dtype="int64")
+
+    user_emb = fluid.layers.embedding(
+        input=user_data, size=[vocab_size, emb_size], param_attr="emb.item")
+    all_item_emb = fluid.layers.embedding(
+        input=all_item_data, size=[vocab_size, emb_size], param_attr="emb.item")
+    all_item_emb_re = fluid.layers.reshape(x=all_item_emb, shape=[-1, emb_size])
+
+    user_encoder = net.GrnnEncoder(hidden_size=hidden_size)
+    user_enc = user_encoder.forward(user_emb)
+    user_hid = fluid.layers.fc(input=user_enc,
+                               size=hidden_size,
+                               param_attr='user.w',
+                               bias_attr="user.b")
+    user_exp = fluid.layers.expand(x=user_hid, expand_times=[1, vocab_size])
+    user_re = fluid.layers.reshape(x=user_exp, shape=[-1, hidden_size])
+
+    all_item_hid = fluid.layers.fc(input=all_item_emb_re,
+                                   size=hidden_size,
+                                   param_attr='item.w',
+                                   bias_attr="item.b")
+    cos_item = fluid.layers.cos_sim(X=all_item_hid, Y=user_re)
+    all_pre_ = fluid.layers.reshape(x=cos_item, shape=[-1, vocab_size])
+    pos_label = fluid.layers.data(name="pos_label", shape=[1], dtype="int64")
+    acc = fluid.layers.accuracy(input=all_pre_, label=pos_label, k=20)
+    return acc
+
+
+def infer(args, vocab_size, test_reader):
+    """ inference function """
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    emb_size = args.emb_size
+    hid_size = args.hid_size
+    batch_size = args.batch_size
+    model_path = args.model_dir
+    with fluid.scope_guard(fluid.core.Scope()):
+        main_program = fluid.Program()
+        start_up_program = fluid.Program()
+        with fluid.program_guard(main_program, start_up_program):
+            acc = model(vocab_size, emb_size, hid_size)
+            for epoch in xrange(start_index, last_index + 1):
+                copy_program = main_program.clone()
+                model_path = model_dir + "/epoch_" + str(epoch)
+                fluid.io.load_params(
+                    executor=exe, dirname=model_path, main_program=copy_program)
+                accum_num_recall = 0.0
+                accum_num_sum = 0.0
+                t0 = time.time()
+                step_id = 0
+                for data in test_reader():
+                    step_id += 1
+                    user_data, pos_label = utils.infer_data(data, place)
+                    all_item_numpy = np.tile(
+                        np.arange(vocab_size), len(pos_label)).reshape(
+                            len(pos_label), vocab_size, 1)
+                    para = exe.run(copy_program,
+                                   feed={
+                                       "user": user_data,
+                                       "all_item": all_item_numpy,
+                                       "pos_label": pos_label
+                                   },
+                                   fetch_list=[acc.name],
+                                   return_numpy=False)
+
+                    acc_ = para[0]._get_float_element(0)
+                    data_length = len(
+                        np.concatenate(
+                            pos_label, axis=0).astype("int64"))
+                    accum_num_sum += (data_length)
+                    accum_num_recall += (data_length * acc_)
+                    if step_id % 1 == 0:
+                        print("step:%d  " % (step_id),
+                              accum_num_recall / accum_num_sum)
+                t1 = time.time()
+                print("model:%s recall@20:%.3f time_cost(s):%.2f" %
+                      (model_path, accum_num_recall / accum_num_sum, t1 - t0))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    start_index = args.start_index
+    last_index = args.last_index
+    test_dir = args.test_dir
+    model_dir = args.model_dir
+    batch_size = args.batch_size
+    vocab_path = args.vocab_path
+    use_cuda = True if args.use_cuda else False
+    print("start index: ", start_index, " last_index:", last_index)
+    test_reader, vocab_size = utils.construct_test_data(
+        test_dir, vocab_path, batch_size=args.batch_size)
+    infer(args, vocab_size, test_reader=test_reader)
diff --git a/fluid/PaddleRec/ssr/nets.py b/fluid/PaddleRec/ssr/nets.py
index 278cb8fdde2d63e1e5675c1dbdcfb11152116e73..4df23573c91fcf16a4ef95d1bab1ac01e437d148 100644
--- a/fluid/PaddleRec/ssr/nets.py
+++ b/fluid/PaddleRec/ssr/nets.py
@@ -17,35 +17,60 @@ import paddle.fluid.layers.nn as nn
 import paddle.fluid.layers.tensor as tensor
 import paddle.fluid.layers.control_flow as cf
 import paddle.fluid.layers.io as io
-from PaddleRec.multiview_simnet.nets import BowEncoder
-from PaddleRec.multiview_simnet.nets import GrnnEncoder
+
+
+class BowEncoder(object):
+    """ bow-encoder """
+
+    def __init__(self):
+        self.param_name = ""
+
+    def forward(self, emb):
+        return nn.sequence_pool(input=emb, pool_type='sum')
+
+
+class GrnnEncoder(object):
+    """ grnn-encoder """
+
+    def __init__(self, param_name="grnn", hidden_size=128):
+        self.param_name = param_name
+        self.hidden_size = hidden_size
+
+    def forward(self, emb):
+        fc0 = nn.fc(input=emb,
+                    size=self.hidden_size * 3,
+                    param_attr=self.param_name + "_fc.w",
+                    bias_attr=False)
+
+        gru_h = nn.dynamic_gru(
+            input=fc0,
+            size=self.hidden_size,
+            is_reverse=False,
+            param_attr=self.param_name + ".param",
+            bias_attr=self.param_name + ".bias")
+        return nn.sequence_pool(input=gru_h, pool_type='max')
 
 
 class PairwiseHingeLoss(object):
     def __init__(self, margin=0.8):
         self.margin = margin
+
     def forward(self, pos, neg):
         loss_part1 = nn.elementwise_sub(
             tensor.fill_constant_batch_size_like(
-                input=pos,
-                shape=[-1, 1],
-                value=self.margin,
-                dtype='float32'),
+                input=pos, shape=[-1, 1], value=self.margin, dtype='float32'),
             pos)
         loss_part2 = nn.elementwise_add(loss_part1, neg)
         loss_part3 = nn.elementwise_max(
             tensor.fill_constant_batch_size_like(
-                input=loss_part2, 
-                shape=[-1, 1],
-                value=0.0, 
-                dtype='float32'),
+                input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
             loss_part2)
         return loss_part3
 
 
 class SequenceSemanticRetrieval(object):
     """ sequence semantic retrieval model """
-    
+
     def __init__(self, embedding_size, embedding_dim, hidden_size):
         self.embedding_size = embedding_size
         self.embedding_dim = embedding_dim
@@ -54,48 +79,44 @@ class SequenceSemanticRetrieval(object):
         self.user_encoder = GrnnEncoder(hidden_size=hidden_size)
         self.item_encoder = BowEncoder()
         self.pairwise_hinge_loss = PairwiseHingeLoss()
-        
+
     def get_correct(self, x, y):
         less = tensor.cast(cf.less_than(x, y), dtype='float32')
         correct = nn.reduce_sum(less)
         return correct
 
     def train(self):
-        user_data = io.data(
-            name="user", shape=[1], dtype="int64", lod_level=1
-        )
+        user_data = io.data(name="user", shape=[1], dtype="int64", lod_level=1)
         pos_item_data = io.data(
-            name="p_item", shape=[1], dtype="int64", lod_level=1
-        )
+            name="p_item", shape=[1], dtype="int64", lod_level=1)
         neg_item_data = io.data(
-            name="n_item", shape=[1], dtype="int64", lod_level=1
-        )
+            name="n_item", shape=[1], dtype="int64", lod_level=1)
         user_emb = nn.embedding(
-            input=user_data, size=self.emb_shape, param_attr="emb.item"
-        )
+            input=user_data, size=self.emb_shape, param_attr="emb.item")
         pos_item_emb = nn.embedding(
-            input=pos_item_data, size=self.emb_shape, param_attr="emb.item"
-        )
+            input=pos_item_data, size=self.emb_shape, param_attr="emb.item")
         neg_item_emb = nn.embedding(
-            input=neg_item_data, size=self.emb_shape, param_attr="emb.item"
-        )
+            input=neg_item_data, size=self.emb_shape, param_attr="emb.item")
         user_enc = self.user_encoder.forward(user_emb)
         pos_item_enc = self.item_encoder.forward(pos_item_emb)
         neg_item_enc = self.item_encoder.forward(neg_item_emb)
-        user_hid = nn.fc(
-            input=user_enc, size=self.hidden_size, param_attr='user.w', bias_attr="user.b"
-        )
-        pos_item_hid = nn.fc(
-            input=pos_item_enc, size=self.hidden_size, param_attr='item.w', bias_attr="item.b"
-        )
-        neg_item_hid = nn.fc(
-            input=neg_item_enc, size=self.hidden_size, param_attr='item.w', bias_attr="item.b"
-        )
+        user_hid = nn.fc(input=user_enc,
+                         size=self.hidden_size,
+                         param_attr='user.w',
+                         bias_attr="user.b")
+        pos_item_hid = nn.fc(input=pos_item_enc,
+                             size=self.hidden_size,
+                             param_attr='item.w',
+                             bias_attr="item.b")
+        neg_item_hid = nn.fc(input=neg_item_enc,
+                             size=self.hidden_size,
+                             param_attr='item.w',
+                             bias_attr="item.b")
         cos_pos = nn.cos_sim(user_hid, pos_item_hid)
         cos_neg = nn.cos_sim(user_hid, neg_item_hid)
         hinge_loss = self.pairwise_hinge_loss.forward(cos_pos, cos_neg)
         avg_cost = nn.mean(hinge_loss)
         correct = self.get_correct(cos_neg, cos_pos)
 
-        return [user_data, pos_item_data, neg_item_data], \
-            pos_item_hid, neg_item_hid, avg_cost, correct
+        return [user_data, pos_item_data,
+                neg_item_data], cos_pos, avg_cost, correct
diff --git a/fluid/PaddleRec/ssr/reader.py b/fluid/PaddleRec/ssr/reader.py
index 97e0ae8ec1cd4089b5b291ac7a4552b73ab231ee..15989fd8cec366b2c3b71672f134035c42bf79da 100644
--- a/fluid/PaddleRec/ssr/reader.py
+++ b/fluid/PaddleRec/ssr/reader.py
@@ -14,19 +14,22 @@
 
 import random
 
+
 class Dataset:
     def __init__(self):
         pass
 
+
 class Vocab:
     def __init__(self):
         pass
 
+
 class YoochooseVocab(Vocab):
     def __init__(self):
         self.vocab = {}
         self.word_array = []
-    
+
     def load(self, filelist):
         idx = 0
         for f in filelist:
@@ -47,21 +50,16 @@ class YoochooseVocab(Vocab):
     def _get_word_array(self):
         return self.word_array
 
+
 class YoochooseDataset(Dataset):
-    def __init__(self, y_vocab):
-        self.vocab_size = len(y_vocab.get_vocab())
-        self.word_array = y_vocab._get_word_array()
-        self.vocab = y_vocab.get_vocab()
+    def __init__(self, vocab_size):
+        self.vocab_size = vocab_size
 
     def sample_neg(self):
         return random.randint(0, self.vocab_size - 1)
 
     def sample_neg_from_seq(self, seq):
         return seq[random.randint(0, len(seq) - 1)]
-    
-    # TODO(guru4elephant): wait memory, should be improved
-    def sample_from_word_freq(self):
-        return self.word_array[random.randint(0, len(self.word_array) - 1)]
 
     def _reader_creator(self, filelist, is_train):
         def reader():
@@ -72,23 +70,20 @@ class YoochooseDataset(Dataset):
                         ids = line.strip().split()
                         if len(ids) <= 1:
                             continue
-                        conv_ids = [self.vocab[i] if i in self.vocab else 0 for i in ids]
-                        # random select an index as boundary
-                        # make ids before boundary as sequence
-                        # make id next to boundary right as target
-                        boundary = random.randint(1, len(ids) - 1)
+                        conv_ids = [i for i in ids]
+                        boundary = len(ids) - 1
                         src = conv_ids[:boundary]
                         pos_tgt = [conv_ids[boundary]]
                         if is_train:
-                            neg_tgt = [self.sample_from_word_freq()]
+                            neg_tgt = [self.sample_neg()]
                             yield [src, pos_tgt, neg_tgt]
                         else:
                             yield [src, pos_tgt]
+
         return reader
 
     def train(self, file_list):
         return self._reader_creator(file_list, True)
-    
+
     def test(self, file_list):
         return self._reader_creator(file_list, False)
-
diff --git a/fluid/PaddleRec/ssr/test_data/small_test.txt b/fluid/PaddleRec/ssr/test_data/small_test.txt
new file mode 100644
index 0000000000000000000000000000000000000000..b4bf7189643a041b769fc88c56f6b1ec5b5229db
--- /dev/null
+++ b/fluid/PaddleRec/ssr/test_data/small_test.txt
@@ -0,0 +1,100 @@
+0 16 
+475 473 155 
+491 21 
+96 185 96 
+29 14 13 
+5 481 11 21 470 
+70 5 70 11 
+167 42 167 217 
+72 15 73 161 172 
+82 82 
+97 297 97 
+193 182 186 183 184 177 214 
+152 152 
+163 298 7 
+39 73 71 
+490 23 23 496 488 74 23 74 486 23 23 74 
+17 17 
+170 170 483 444 443 234 
+25 472 
+5 5 11 70 69 
+149 149 455 
+356 68 477 468 17 479 66 
+159 172 6 71 6 6 158 13 494 169 
+155 44 438 144 500 
+156 9 9 
+146 146 
+173 10 10 461 
+7 6 6 
+269 48 268 
+50 100 
+323 174 18 
+69 69 22 98 
+38 171 
+22 29 489 10 
+0 0 
+11 5 
+29 13 14 232 231 451 289 452 229 
+260 11 156 
+166 160 166 39 
+223 134 134 420 
+66 401 68 132 17 84 287 5 
+39 304 
+65 84 132 
+400 211 
+145 144 
+16 28 254 48 50 100 42 154 262 133 17 
+0 0 
+28 28 
+11 476 464 
+61 61 86 86 
+38 38 
+463 478 
+437 265 
+22 39 485 171 98 
+434 51 344 
+16 16 
+67 67 67 448 
+22 12 161 
+15 377 147 147 374 
+119 317 0 
+38 484 
+403 499 
+432 442 
+28 0 16 50 465 42 
+163 487 7 162 
+99 99 325 423 83 83 
+154 133 
+5 37 492 235 160 279 
+10 10 457 493 10 460 
+441 4 4 4 4 4 4 4 
+153 153 
+159 164 164 
+328 37 
+65 65 404 347 431 459 
+80 80 44 44 
+61 446 
+162 495 7 453 
+157 21 204 68 37 66 469 145 
+37 151 230 206 240 205 264 87 409 87 288 270 280 329 157 296 454 474 
+430 445 433 
+449 14 
+9 9 9 9 
+440 238 226 
+148 148 
+266 267 181 
+48 498 
+263 255 256 
+458 158 7 
+72 168 12 165 71 73 173 49 
+0 0 
+7 7 6 
+14 29 13 6 15 14 15 13 
+480 439 21 
+450 21 151 
+12 12 49 14 13 165 12 169 72 15 15 
+91 91 
+22 12 49 168 
+497 101 30 411 30 482 30 53 30 101 176 415 53 447 
+462 150 150 
+471 456 131 435 131 467 436 412 227 218 190 466 429 213 326 
diff --git a/fluid/PaddleRec/ssr/train.py b/fluid/PaddleRec/ssr/train.py
index 33fe23e55795e47dea3e7f767016a8be4492a4d0..8ca5e8ee3b9d79af52a9e75c7540f1e104750b96 100644
--- a/fluid/PaddleRec/ssr/train.py
+++ b/fluid/PaddleRec/ssr/train.py
@@ -13,87 +13,108 @@
 # limitations under the License.
 import os
 import sys
+import time
 import argparse
 import logging
 import paddle.fluid as fluid
 import paddle
-import reader as reader
+import utils
+import numpy as np
 from nets import SequenceSemanticRetrieval
 
 logging.basicConfig(format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
 
+
 def parse_args():
     parser = argparse.ArgumentParser("sequence semantic retrieval")
-    parser.add_argument("--train_file", type=str, help="Training file")
-    parser.add_argument("--valid_file", type=str, help="Validation file")
     parser.add_argument(
-        "--epochs", type=int, default=10, help="Number of epochs for training")
+        "--train_dir", type=str, default='train_data', help="Training file")
+    parser.add_argument(
+        "--base_lr", type=float, default=0.01, help="learning rate")
+    parser.add_argument(
+        '--vocab_path', type=str, default='vocab.txt', help='vocab file')
+    parser.add_argument(
+        "--epochs", type=int, default=10, help="Number of epochs")
+    parser.add_argument(
+        '--parallel', type=int, default=0, help='whether parallel')
+    parser.add_argument(
+        '--use_cuda', type=int, default=0, help='whether use gpu')
+    parser.add_argument(
+        '--print_batch', type=int, default=10, help='num of print batch')
     parser.add_argument(
-        "--model_output_dir",
-        type=str,
-        default='model_output',
-        help="Model output folder")
+        '--model_dir', type=str, default='model_output', help='model dir')
     parser.add_argument(
-        "--sequence_encode_dim",
-        type=int,
-        default=128,
-        help="Dimension of sequence encoder output")
+        "--hidden_size", type=int, default=128, help="hidden size")
     parser.add_argument(
-        "--matching_dim",
-        type=int,
-        default=128,
-        help="Dimension of hidden layer")
+        "--batch_size", type=int, default=50, help="number of batch")
     parser.add_argument(
-        "--batch_size", type=int, default=128, help="Batch size for training")
+        "--embedding_dim", type=int, default=128, help="embedding dim")
     parser.add_argument(
-        "--embedding_dim",
-        type=int,
-        default=128,
-        help="Default Dimension of Embedding")
+        '--num_devices', type=int, default=1, help='Number of GPU devices')
     return parser.parse_args()
 
-def start_train(args):
-    y_vocab = reader.YoochooseVocab()
-    y_vocab.load([args.train_file])
 
-    logger.info("Load yoochoose vocabulary size: {}".format(len(y_vocab.get_vocab())))
-    y_data = reader.YoochooseDataset(y_vocab)
-    train_reader = paddle.batch(
-        paddle.reader.shuffle(
-            y_data.train([args.train_file]), buf_size=args.batch_size * 100),
-        batch_size=args.batch_size)
-    place = fluid.CPUPlace()
-    ssr = SequenceSemanticRetrieval(
-        len(y_vocab.get_vocab()), args.embedding_dim, args.matching_dim
-    )
-    input_data, user_rep, item_rep, avg_cost, acc = ssr.train()
-    optimizer = fluid.optimizer.Adam(learning_rate=1e-4)
+def get_cards(args):
+    return args.num_devices
+
+
+def train(args):
+    use_cuda = True if args.use_cuda else False
+    parallel = True if args.parallel else False
+    print("use_cuda:", use_cuda, "parallel:", parallel)
+    train_reader, vocab_size = utils.construct_train_data(
+        args.train_dir, args.vocab_path, args.batch_size * get_cards(args))
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    ssr = SequenceSemanticRetrieval(vocab_size, args.embedding_dim,
+                                    args.hidden_size)
+    # Train program
+    train_input_data, cos_pos, avg_cost, acc = ssr.train()
+
+    # Optimization to minimize lost
+    optimizer = fluid.optimizer.Adagrad(learning_rate=args.base_lr)
     optimizer.minimize(avg_cost)
-    startup_program = fluid.default_startup_program()
-    loop_program = fluid.default_main_program()
-    data_list = [var.name for var in input_data]
+
+    data_list = [var.name for var in train_input_data]
     feeder = fluid.DataFeeder(feed_list=data_list, place=place)
     exe = fluid.Executor(place)
-    exe.run(startup_program)
+    exe.run(fluid.default_startup_program())
+    if parallel:
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=use_cuda, loss_name=avg_cost.name)
+    else:
+        train_exe = exe
 
+    total_time = 0.0
     for pass_id in range(args.epochs):
+        epoch_idx = pass_id + 1
+        print("epoch_%d start" % epoch_idx)
+        t0 = time.time()
+        i = 0
         for batch_id, data in enumerate(train_reader()):
-            loss_val, correct_val = exe.run(loop_program,
-                                            feed=feeder.feed(data),
-                                            fetch_list=[avg_cost, acc])
-            logger.info("Train --> pass: {} batch_id: {} avg_cost: {}, acc: {}".
-                        format(pass_id, batch_id, loss_val, 
-                               float(correct_val) / args.batch_size))
-        fluid.io.save_inference_model(args.model_output_dir, 
-                                      [var.name for val in input_data],
-                                      [user_rep, item_rep, avg_cost, acc], exe)
+            i += 1
+            loss_val, correct_val = train_exe.run(
+                feed=feeder.feed(data), fetch_list=[avg_cost.name, acc.name])
+            if i % args.print_batch == 0:
+                logger.info(
+                    "Train --> pass: {} batch_id: {} avg_cost: {}, acc: {}".
+                    format(pass_id, batch_id,
+                           np.mean(loss_val),
+                           float(np.mean(correct_val)) / args.batch_size))
+        t1 = time.time()
+        total_time += t1 - t0
+        print("epoch:%d num_steps:%d time_cost(s):%f" %
+              (epoch_idx, i, total_time / epoch_idx))
+        save_dir = "%s/epoch_%d" % (args.model_dir, epoch_idx)
+        fluid.io.save_params(executor=exe, dirname=save_dir)
+        print("model saved in %s" % save_dir)
+
 
 def main():
     args = parse_args()
-    start_train(args)
+    train(args)
+
 
 if __name__ == "__main__":
     main()
-
diff --git a/fluid/PaddleRec/ssr/train_data/small_train.txt b/fluid/PaddleRec/ssr/train_data/small_train.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6252a52c5ce3fe5bcc4f28c274e67461f47e1586
--- /dev/null
+++ b/fluid/PaddleRec/ssr/train_data/small_train.txt
@@ -0,0 +1,100 @@
+197 196 198 236 
+93 93 384 362 363 43 
+336 364 407 
+421 322 
+314 388 
+128 58 
+138 138 
+46 46 46 
+34 34 57 57 57 342 228 321 346 357 59 376 
+110 110 
+135 94 135 
+27 250 27 
+129 118 
+18 18 18 
+81 81 89 89 
+27 27 
+20 20 20 20 20 212 
+33 33 33 33 
+62 62 62 63 63 55 248 124 381 428 383 382 43 43 261 63 
+90 90 78 78 
+399 397 202 141 104 104 245 192 191 271 
+239 332 283 88 
+187 313 
+136 136 324 
+41 41 
+352 128 
+413 414 
+410 45 45 45 1 1 1 1 1 1 1 1 31 31 31 31 
+92 334 92 
+95 285 
+215 249 
+390 41 
+116 116 
+300 252 
+2 2 2 2 2 
+8 8 8 8 8 8 
+53 241 259 
+118 129 126 94 137 208 216 299 
+209 368 139 418 419 
+311 180 
+303 302 203 284 
+369 32 32 32 32 337 
+207 47 47 47 
+106 107 
+143 143 
+179 178 
+109 109 
+405 79 79 371 246 
+251 417 427 
+333 88 387 358 123 348 394 360 36 365 
+3 3 3 3 3 
+189 188 
+398 425 
+107 406 
+281 201 141 
+2 2 2 
+359 54 
+395 385 293 
+60 60 60 121 121 233 58 58 
+24 199 175 24 24 24 351 386 106 
+115 294 
+122 122 127 127 
+35 35 
+282 393 
+277 140 140 343 225 123 36 36 36 221 114 114 59 59 117 117 247 367 219 258 222 301 375 350 353 111 111 
+275 272 273 274 331 330 305 108 76 76 108 
+26 26 26 408 26 
+290 18 210 291 
+372 139 424 113 
+341 340 335 
+120 370 
+224 200 
+426 416 
+137 319 
+402 55 
+54 54 
+327 119 
+125 125 
+391 396 354 355 389 
+142 142 
+295 320 
+113 366 
+253 85 85 
+56 56 310 309 308 307 278 25 25 19 19 3 312 19 19 19 3 25 
+220 338 
+34 130 
+130 120 380 315 
+339 422 
+379 378 
+95 56 392 115 
+55 124 
+126 34 
+349 373 361 
+195 194 
+75 75 
+64 64 64 
+35 35 
+40 40 40 242 77 244 77 243 
+257 316 
+103 306 102 51 52 103 105 52 52 292 318 112 286 345 237 276 112 51 102 105 
diff --git a/fluid/PaddleRec/ssr/utils.py b/fluid/PaddleRec/ssr/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fe9ef470ed0a2a5da7bef6a975f45e5a04ab18e
--- /dev/null
+++ b/fluid/PaddleRec/ssr/utils.py
@@ -0,0 +1,49 @@
+import numpy as np
+import reader as reader
+import os
+import logging
+import paddle.fluid as fluid
+import paddle
+
+
+def get_vocab_size(vocab_path):
+    with open(vocab_path, "r") as rf:
+        line = rf.readline()
+        return int(line.strip())
+
+
+def construct_train_data(file_dir, vocab_path, batch_size):
+    vocab_size = get_vocab_size(vocab_path)
+    files = [file_dir + '/' + f for f in os.listdir(file_dir)]
+    y_data = reader.YoochooseDataset(vocab_size)
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            y_data.train(files), buf_size=batch_size * 100),
+        batch_size=batch_size)
+    return train_reader, vocab_size
+
+
+def construct_test_data(file_dir, vocab_path, batch_size):
+    vocab_size = get_vocab_size(vocab_path)
+    files = [file_dir + '/' + f for f in os.listdir(file_dir)]
+    y_data = reader.YoochooseDataset(vocab_size)
+    test_reader = paddle.batch(y_data.test(files), batch_size=batch_size)
+    return test_reader, vocab_size
+
+
+def infer_data(raw_data, place):
+    data = [dat[0] for dat in raw_data]
+    seq_lens = [len(seq) for seq in data]
+    cur_len = 0
+    lod = [cur_len]
+    for l in seq_lens:
+        cur_len += l
+        lod.append(cur_len)
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    res = fluid.LoDTensor()
+    res.set(flattened_data, place)
+    res.set_lod([lod])
+    p_label = [dat[1] for dat in raw_data]
+    pos_label = np.array(p_label).astype("int64").reshape(len(p_label), 1)
+    return res, pos_label
diff --git a/fluid/PaddleRec/ssr/vocab.txt b/fluid/PaddleRec/ssr/vocab.txt
new file mode 100644
index 0000000000000000000000000000000000000000..c15fb720f8f8a9163cfec319b226864a3246a7e7
--- /dev/null
+++ b/fluid/PaddleRec/ssr/vocab.txt
@@ -0,0 +1 @@
+501
diff --git a/fluid/PaddleRec/tagspace/train.py b/fluid/PaddleRec/tagspace/train.py
index 69750084336760f4ee792cda9c217f1fbe0a8ce6..914c824c134a8c60c790dc5473431215924e0dff 100644
--- a/fluid/PaddleRec/tagspace/train.py
+++ b/fluid/PaddleRec/tagspace/train.py
@@ -21,15 +21,9 @@ def parse_args():
     parser.add_argument(
         '--train_dir', type=str, default='train_data', help='train file')
     parser.add_argument(
-        '--vocab_text_path',
-        type=str,
-        default='vocab_text.txt',
-        help='vocab_text file')
+        '--vocab_text_path', type=str, default='vocab_text.txt', help='text')
     parser.add_argument(
-        '--vocab_tag_path',
-        type=str,
-        default='vocab_tag.txt',
-        help='vocab_text file')
+        '--vocab_tag_path', type=str, default='vocab_tag.txt', help='tag')
     parser.add_argument(
         '--model_dir', type=str, default='model_', help='model dir')
     parser.add_argument(
diff --git a/fluid/PaddleRec/word2vec/README.cn.md b/fluid/PaddleRec/word2vec/README.cn.md
index 9c876ccf42ab928f43ab843dd92e67e927c05e2e..13e79c413103529844227b6f9836cadc17e1e1aa 100644
--- a/fluid/PaddleRec/word2vec/README.cn.md
+++ b/fluid/PaddleRec/word2vec/README.cn.md
@@ -25,6 +25,14 @@ cd data && ./download.sh && cd ..
 ```bash
 python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
 ```
+如果您想使用自定义的词典形如：
+```bash
+<UNK>
+a
+b
+c
+```
+请将--other_dict_path设置为您存放将使用的词典的目录，并设置--with_other_dict使用它
 
 ## 训练
 训练的命令行选项可以通过`python train.py -h`列出。
@@ -32,11 +40,21 @@ python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchma
 ### 单机训练：
 
 ```bash
+export CPU_NUM=1
 python train.py \
         --train_data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled \
         --dict_path data/1-billion_dict \
+        --with_hs --with_nce --is_local \
         2>&1 | tee train.log
 ```
+如果您想使用自定义的词典形如：
+```bash
+<UNK>
+a
+b
+c
+```
+请将--other_dict_path设置为您存放将使用的词典的目录，并设置--with_other_dict使用它
 
 ### 分布式训练
 
@@ -59,6 +77,11 @@ sh cluster_train.sh
 
 您也可以在`build_test_case`方法中模仿给出的例子增加自己的测试
 
+要从测试文件运行测试用例，请将测试文件下载到“test”目录中
+我们为每个案例提供以下结构的测试：
+        `word1 word2 word3 word4`
+所以我们可以将它构建成`word1  -  word2 + word3 = word4`
+
 训练中预测：
 
 ```bash
diff --git a/fluid/PaddleRec/word2vec/README.md b/fluid/PaddleRec/word2vec/README.md
index c99b9c2aa2bb8137f4e44115786e88fe966b3483..17a61a4286400f845540ef40bfa5875a094ad0e0 100644
--- a/fluid/PaddleRec/word2vec/README.md
+++ b/fluid/PaddleRec/word2vec/README.md
@@ -14,6 +14,11 @@ Download dataset:
 ```bash
 cd data && ./download.sh && cd ..
 ```
+if you would like to use our supported third party vocab, please run:
+
+```bash
+wget http://download.tensorflow.org/models/LM_LSTM_CNN/vocab-2016-09-10.txt
+```
 
 ## Model
 This model implement a skip-gram model of word2vector.
@@ -26,18 +31,31 @@ Preprocess the training data to generate a word dict.
 ```bash
 python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
 ```
+if you would like to use your own vocab follow the format below:
+```bash
+<UNK>
+a
+b
+c
+```
+Then, please set --other_dict_path as the directory of where you
+save the vocab you will use and set --with_other_dict flag on to using it.
 
 ## Train
 The command line options for training can be listed by `python train.py -h`.
 
 ### Local Train:
+we set CPU_NUM=1 as default CPU_NUM to execute
 ```bash
+export CPU_NUM=1 && \
 python train.py \
         --train_data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled \
         --dict_path data/1-billion_dict \
+        --with_hs --with_nce --is_local \
         2>&1 | tee train.log
 ```
-
+if you would like to use our supported third party vocab, please set --other_dict_path as the directory of where you
+save the vocab you will use and set --with_other_dict flag on to using it.
 
 ### Distributed Train
 Run a 2 pserver 2 trainer distribute training on a single machine.
@@ -62,6 +80,11 @@ For: boy - girl + aunt = uncle
 
 You can also add your own tests by mimicking the examples given in the `build_test_case` method.
 
+To running test case from test files, please download the test files into 'test' directory
+we provide test for each case with the following structure:
+        `word1 word2 word3 word4`
+so we can build it into `word1 - word2 + word3 = word4`
+
 Forecast in training:
 
 ```bash
diff --git a/fluid/PaddleRec/word2vec/data/download.sh b/fluid/PaddleRec/word2vec/data/download.sh
index 4ba05c630bfa357445c8d7b8a4e1eacd153a77b9..22cde6d926091c179e16a3ab146952e2065893c3 100644
--- a/fluid/PaddleRec/word2vec/data/download.sh
+++ b/fluid/PaddleRec/word2vec/data/download.sh
@@ -2,4 +2,3 @@
 
 wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
 tar -zxvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
-
diff --git a/fluid/PaddleRec/word2vec/infer.py b/fluid/PaddleRec/word2vec/infer.py
index 28111588549a0f3e33524d55c4e2cc0beb230319..c0dd82ef7f0c3a0c6012e80ea2e85a8b80d44a75 100644
--- a/fluid/PaddleRec/word2vec/infer.py
+++ b/fluid/PaddleRec/word2vec/infer.py
@@ -1,12 +1,10 @@
-import paddle
 import time
 import os
 import paddle.fluid as fluid
 import numpy as np
-from Queue import PriorityQueue
 import logging
 import argparse
-from sklearn.metrics.pairwise import cosine_similarity
+import preprocess
 
 word_to_id = dict()
 id_to_word = dict()
@@ -47,6 +45,22 @@ def parse_args():
         required=False,
         default=True,
         help='if using infer_during_train, (default: True)')
+    parser.add_argument(
+        '--test_acc',
+        action='store_true',
+        required=False,
+        default=False,
+        help='if using test_files , (default: False)')
+    parser.add_argument(
+        '--test_files_dir',
+        type=str,
+        default='test',
+        help="The path for test_files) (default: test)")
+    parser.add_argument(
+        '--test_batch_size',
+        type=int,
+        default=1000,
+        help="test used batch size (default: 1000)")
 
     return parser.parse_args()
 
@@ -58,75 +72,168 @@ def BuildWord_IdMap(dict_path):
             id_to_word[int(line.split(' ')[1])] = line.split(' ')[0]
 
 
-def inference_prog():
+def inference_prog():  # just to create program for test
     fluid.layers.create_parameter(
         shape=[1, 1], dtype='float32', name="embeding")
 
 
-def build_test_case(emb):
+def build_test_case_from_file(args, emb):
+    logger.info("test files dir: {}".format(args.test_files_dir))
+    current_list = os.listdir(args.test_files_dir)
+    logger.info("test files list: {}".format(current_list))
+    test_cases = list()
+    test_labels = list()
+    test_case_descs = list()
+    exclude_lists = list()
+    for file_dir in current_list:
+        with open(args.test_files_dir + "/" + file_dir, 'r') as f:
+            for line in f:
+                if ':' in line:
+                    logger.info("{}".format(line))
+                    pass
+                else:
+                    line = preprocess.strip_lines(line, word_to_id)
+                    test_case = emb[word_to_id[line.split()[0]]] - emb[
+                        word_to_id[line.split()[1]]] + emb[word_to_id[
+                            line.split()[2]]]
+                    test_case_desc = line.split()[0] + " - " + line.split()[
+                        1] + " + " + line.split()[2] + " = " + line.split()[3]
+                    test_cases.append(test_case)
+                    test_case_descs.append(test_case_desc)
+                    test_labels.append(word_to_id[line.split()[3]])
+                    exclude_lists.append([
+                        word_to_id[line.split()[0]],
+                        word_to_id[line.split()[1]], word_to_id[line.split()[2]]
+                    ])
+            test_cases = norm(np.array(test_cases))
+    return test_cases, test_case_descs, test_labels, exclude_lists
+
+
+def build_small_test_case(emb):
     emb1 = emb[word_to_id['boy']] - emb[word_to_id['girl']] + emb[word_to_id[
         'aunt']]
     desc1 = "boy - girl + aunt = uncle"
+    label1 = word_to_id["uncle"]
     emb2 = emb[word_to_id['brother']] - emb[word_to_id['sister']] + emb[
         word_to_id['sisters']]
     desc2 = "brother - sister + sisters = brothers"
+    label2 = word_to_id["brothers"]
     emb3 = emb[word_to_id['king']] - emb[word_to_id['queen']] + emb[word_to_id[
         'woman']]
     desc3 = "king - queen + woman = man"
+    label3 = word_to_id["man"]
     emb4 = emb[word_to_id['reluctant']] - emb[word_to_id['reluctantly']] + emb[
         word_to_id['slowly']]
     desc4 = "reluctant - reluctantly + slowly = slow"
+    label4 = word_to_id["slow"]
     emb5 = emb[word_to_id['old']] - emb[word_to_id['older']] + emb[word_to_id[
         'deeper']]
     desc5 = "old - older + deeper = deep"
-    return [[emb1, desc1], [emb2, desc2], [emb3, desc3], [emb4, desc4],
-            [emb5, desc5]]
+    label5 = word_to_id["deep"]
+
+    emb6 = emb[word_to_id['boy']]
+    desc6 = "boy"
+    label6 = word_to_id["boy"]
+    emb7 = emb[word_to_id['king']]
+    desc7 = "king"
+    label7 = word_to_id["king"]
+    emb8 = emb[word_to_id['sun']]
+    desc8 = "sun"
+    label8 = word_to_id["sun"]
+    emb9 = emb[word_to_id['key']]
+    desc9 = "key"
+    label9 = word_to_id["key"]
+    test_cases = [emb1, emb2, emb3, emb4, emb5, emb6, emb7, emb8, emb9]
+    test_case_desc = [
+        desc1, desc2, desc3, desc4, desc5, desc6, desc7, desc8, desc9
+    ]
+    test_labels = [
+        label1, label2, label3, label4, label5, label6, label7, label8, label9
+    ]
+    return norm(np.array(test_cases)), test_case_desc, test_labels
+
+
+def build_test_case(args, emb):
+    if args.test_acc:
+        return build_test_case_from_file(args, emb)
+    else:
+        return build_small_test_case(emb)
+
+
+def norm(x):
+    y = np.linalg.norm(x, axis=1, keepdims=True)
+    return x / y
 
 
 def inference_test(scope, model_dir, args):
     BuildWord_IdMap(args.dict_path)
     logger.info("model_dir is: {}".format(model_dir + "/"))
     emb = np.array(scope.find_var("embeding").get_tensor())
-    test_cases = build_test_case(emb)
+    x = norm(emb)
     logger.info("inference result: ====================")
-    for case in test_cases:
-        pq = topK(args.rank_num, emb, case[0])
-        logger.info("Test result for {}".format(case[1]))
-        pq_tmps = list()
-        for i in range(args.rank_num):
-            pq_tmps.append(pq.get())
-        for i in range(len(pq_tmps)):
-            logger.info("{} nearest is {}, rate is {}".format(i, id_to_word[
-                pq_tmps[len(pq_tmps) - 1 - i].id], pq_tmps[len(pq_tmps) - 1 - i]
-                                                              .priority))
-        del pq_tmps[:]
-
-
-class PQ_Entry(object):
-    def __init__(self, cos_similarity, id):
-        self.priority = cos_similarity
-        self.id = id
-
-    def __cmp__(self, other):
-        return cmp(self.priority, other.priority)
-
-
-def topK(k, emb, test_emb):
-    pq = PriorityQueue(k + 1)
-    if len(emb) <= k:
-        for i in range(len(emb)):
-            x = cosine_similarity([emb[i]], [test_emb])
-            pq.put(PQ_Entry(x, i))
-        return pq
-
-    for i in range(len(emb)):
-        x = cosine_similarity([emb[i]], [test_emb])
-        pq_e = PQ_Entry(x, i)
-        if pq.full():
-            pq.get()
-        pq.put(pq_e)
-    pq.get()
-    return pq
+    test_cases = None
+    test_case_desc = list()
+    test_labels = list()
+    exclude_lists = list()
+    if args.test_acc:
+        test_cases, test_case_desc, test_labels, exclude_lists = build_test_case(
+            args, emb)
+    else:
+        test_cases, test_case_desc, test_labels = build_test_case(args, emb)
+        exclude_lists = [[-1]]
+    accual_rank = 1 if args.test_acc else args.rank_num
+    correct_num = 0
+    cosine_similarity_matrix = np.dot(test_cases, x.T)
+    results = topKs(accual_rank, cosine_similarity_matrix, exclude_lists,
+                    args.test_acc)
+    for i in range(len(test_labels)):
+        logger.info("Test result for {}".format(test_case_desc[i]))
+        result = results[i]
+        for j in range(accual_rank):
+            if result[j][1] == test_labels[
+                    i]:  # if the nearest word is what we want 
+                correct_num += 1
+            logger.info("{} nearest is {}, rate is {}".format(j, id_to_word[
+                result[j][1]], result[j][0]))
+    logger.info("Test acc is: {}, there are {} / {}".format(correct_num / len(
+        test_labels), correct_num, len(test_labels)))
+
+
+def topK(k, cosine_similarity_list, exclude_list, is_acc=False):
+    if k == 1 and is_acc:  # accelerate acc calculate
+        max = cosine_similarity_list[0]
+        id = 0
+        for i in range(len(cosine_similarity_list)):
+            if cosine_similarity_list[i] >= max and (i not in exclude_list):
+                max = cosine_similarity_list[i]
+                id = i
+            else:
+                pass
+        return [[max, id]]
+    else:
+        result = list()
+        result_index = np.argpartition(cosine_similarity_list, -k)[-k:]
+        for index in result_index:
+            result.append([cosine_similarity_list[index], index])
+        result.sort(reverse=True)
+        return result
+
+
+def topKs(k, cosine_similarity_matrix, exclude_lists, is_acc=False):
+    results = list()
+    result_queues = list()
+    correct_num = 0
+
+    for i in range(cosine_similarity_matrix.shape[0]):
+        tmp_pq = None
+        if is_acc:
+            tmp_pq = topK(k, cosine_similarity_matrix[i], exclude_lists[i],
+                          is_acc)
+        else:
+            tmp_pq = topK(k, cosine_similarity_matrix[i], exclude_lists[0],
+                          is_acc)
+        result_queues.append(tmp_pq)
+    return result_queues
 
 
 def infer_during_train(args):
@@ -138,8 +245,6 @@ def infer_during_train(args):
     while True:
         time.sleep(60)
         current_list = os.listdir(args.model_output_dir)
-        # logger.info("current_list is : {}".format(current_list))
-        # logger.info("model_file_list is : {}".format(model_file_list))
         if set(model_file_list) == set(current_list):
             if solved_new:
                 solved_new = False
@@ -174,6 +279,8 @@ def infer_once(args):
             fluid.io.load_persistables(
                 executor=exe, dirname=args.model_output_dir + "/")
             inference_test(Scope, args.model_output_dir, args)
+    else:
+        logger.info("Wrong Directory or save model failed!")
 
 
 if __name__ == '__main__':
@@ -181,5 +288,7 @@ if __name__ == '__main__':
     # while setting infer_once please specify the dir to models file with --model_output_dir
     if args.infer_once:
         infer_once(args)
-    if args.infer_during_train:
+    elif args.infer_during_train:
         infer_during_train(args)
+    else:
+        pass
diff --git a/fluid/PaddleRec/word2vec/network_conf.py b/fluid/PaddleRec/word2vec/network_conf.py
index 5b8e95136a177496a5569d7377d6a2b7f5d30714..16178c339bbcc42c1e5f1e78089a0d0942810444 100644
--- a/fluid/PaddleRec/word2vec/network_conf.py
+++ b/fluid/PaddleRec/word2vec/network_conf.py
@@ -95,8 +95,7 @@ def skip_gram_word2vec(dict_size,
         capacity=64, feed_list=datas, name='py_reader', use_double_buffer=True)
 
     words = fluid.layers.read_file(py_reader)
-
-    emb = fluid.layers.embedding(
+    target_emb = fluid.layers.embedding(
         input=words[0],
         is_sparse=is_sparse,
         size=[dict_size, embedding_size],
@@ -104,16 +103,23 @@ def skip_gram_word2vec(dict_size,
             name='embeding',
             initializer=fluid.initializer.Normal(scale=1 /
                                                  math.sqrt(dict_size))))
-
+    context_emb = fluid.layers.embedding(
+        input=words[1],
+        is_sparse=is_sparse,
+        size=[dict_size, embedding_size],
+        param_attr=fluid.ParamAttr(
+            name='embeding',
+            initializer=fluid.initializer.Normal(scale=1 /
+                                                 math.sqrt(dict_size))))
     cost, cost_nce, cost_hs = None, None, None
 
     if with_nce:
-        cost_nce = nce_layer(emb, words[1], embedding_size, dict_size, 5,
+        cost_nce = nce_layer(target_emb, words[1], embedding_size, dict_size, 5,
                              "uniform", word_frequencys, None)
         cost = cost_nce
     if with_hsigmoid:
-        cost_hs = hsigmoid_layer(emb, words[1], words[2], words[3], dict_size,
-                                 is_sparse)
+        cost_hs = hsigmoid_layer(context_emb, words[0], words[2], words[3],
+                                 dict_size, is_sparse)
         cost = cost_hs
     if with_nce and with_hsigmoid:
         cost = fluid.layers.elementwise_add(cost_nce, cost_hs)
diff --git a/fluid/PaddleRec/word2vec/preprocess.py b/fluid/PaddleRec/word2vec/preprocess.py
index cb8dd100f3179415ec4ce9c08e3e5a2c8ce462b2..f13d335449913305df79d47fa968d578563b25cd 100644
--- a/fluid/PaddleRec/word2vec/preprocess.py
+++ b/fluid/PaddleRec/word2vec/preprocess.py
@@ -1,7 +1,12 @@
 # -*- coding: utf-8 -*
 
 import re
+import six
 import argparse
+import io
+
+prog = re.compile("[^a-z ]", flags=0)
+word_count = dict()
 
 
 def parse_args():
@@ -22,18 +27,75 @@ def parse_args():
         type=int,
         default=5,
         help="If the word count is less then freq, it will be removed from dict")
+
     parser.add_argument(
-        '--is_local',
+        '--with_other_dict',
         action='store_true',
         required=False,
         default=False,
-        help='Local train or not, (default: False)')
+        help='Using third party provided dict , (default: False)')
+
+    parser.add_argument(
+        '--other_dict_path',
+        type=str,
+        default='',
+        help='The path for third party provided dict (default: '
+        ')')
 
     return parser.parse_args()
 
 
 def text_strip(text):
-    return re.sub("[^a-z ]", "", text)
+    return prog.sub("", text)
+
+
+# users can self-define their own strip rules by modifing this method
+def strip_lines(line, vocab=word_count):
+    return _replace_oov(vocab, native_to_unicode(line))
+
+
+# Shameless copy from Tensorflow https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py
+def _replace_oov(original_vocab, line):
+    """Replace out-of-vocab words with "<UNK>".
+  This maintains compatibility with published results.
+  Args:
+    original_vocab: a set of strings (The standard vocabulary for the dataset)
+    line: a unicode string - a space-delimited sequence of words.
+  Returns:
+    a unicode string - a space-delimited sequence of words.
+  """
+    return u" ".join([
+        word if word in original_vocab else u"<UNK>" for word in line.split()
+    ])
+
+
+# Shameless copy from Tensorflow https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py
+# Unicode utility functions that work with Python 2 and 3
+def native_to_unicode(s):
+    if _is_unicode(s):
+        return s
+    try:
+        return _to_unicode(s)
+    except UnicodeDecodeError:
+        res = _to_unicode(s, ignore_errors=True)
+        return res
+
+
+def _is_unicode(s):
+    if six.PY2:
+        if isinstance(s, unicode):
+            return True
+    else:
+        if isinstance(s, str):
+            return True
+    return False
+
+
+def _to_unicode(s, ignore_errors=False):
+    if _is_unicode(s):
+        return s
+    error_mode = "ignore" if ignore_errors else "strict"
+    return s.decode("utf-8", errors=error_mode)
 
 
 def build_Huffman(word_count, max_code_length):
@@ -120,7 +182,7 @@ def build_Huffman(word_count, max_code_length):
     return word_point, word_code, word_code_len
 
 
-def preprocess(data_path, dict_path, freq, is_local):
+def preprocess(args):
     """
     proprocess the data, generate dictionary and save into dict_path.
     :param data_path: the input data path.
@@ -129,14 +191,26 @@ def preprocess(data_path, dict_path, freq, is_local):
     :return:
     """
     # word to count
-    word_count = dict()
-
-    if is_local:
-        for i in range(1, 100):
-            with open(data_path + "/news.en-000{:0>2d}-of-00100".format(
-                    i)) as f:
-                for line in f:
-                    line = line.lower()
+
+    if args.with_other_dict:
+        with io.open(args.other_dict_path, 'r', encoding='utf-8') as f:
+            for line in f:
+                word_count[native_to_unicode(line.strip())] = 1
+
+    for i in range(1, 100):
+        with io.open(
+                args.data_path + "/news.en-000{:0>2d}-of-00100".format(i),
+                encoding='utf-8') as f:
+            for line in f:
+                if args.with_other_dict:
+                    line = strip_lines(line)
+                    words = line.split()
+                    for item in words:
+                        if item in word_count:
+                            word_count[item] = word_count[item] + 1
+                        else:
+                            word_count[native_to_unicode('<UNK>')] += 1
+                else:
                     line = text_strip(line)
                     words = line.split()
                     for item in words:
@@ -146,26 +220,25 @@ def preprocess(data_path, dict_path, freq, is_local):
                             word_count[item] = 1
     item_to_remove = []
     for item in word_count:
-        if word_count[item] <= freq:
+        if word_count[item] <= args.freq:
             item_to_remove.append(item)
     for item in item_to_remove:
         del word_count[item]
 
     path_table, path_code, word_code_len = build_Huffman(word_count, 40)
 
-    with open(dict_path, 'w+') as f:
+    with io.open(args.dict_path, 'w+', encoding='utf-8') as f:
         for k, v in word_count.items():
-            f.write(str(k) + " " + str(v) + '\n')
+            f.write(k + " " + str(v) + '\n')
 
-    with open(dict_path + "_ptable", 'w+') as f2:
+    with io.open(args.dict_path + "_ptable", 'w+', encoding='utf-8') as f2:
         for pk, pv in path_table.items():
-            f2.write(str(pk) + ":" + ' '.join((str(x) for x in pv)) + '\n')
+            f2.write(pk + '\t' + ' '.join((str(x) for x in pv)) + '\n')
 
-    with open(dict_path + "_pcode", 'w+') as f3:
-        for pck, pcv in path_table.items():
-            f3.write(str(pck) + ":" + ' '.join((str(x) for x in pcv)) + '\n')
+    with io.open(args.dict_path + "_pcode", 'w+', encoding='utf-8') as f3:
+        for pck, pcv in path_code.items():
+            f3.write(pck + '\t' + ' '.join((str(x) for x in pcv)) + '\n')
 
 
 if __name__ == "__main__":
-    args = parse_args()
-    preprocess(args.data_path, args.dict_path, args.freq, args.is_local)
+    preprocess(parse_args())
diff --git a/fluid/PaddleRec/word2vec/reader.py b/fluid/PaddleRec/word2vec/reader.py
index 1e6f15ea35be395b8e270853cf9868d09c654dcb..df479a4b71bbb4c2b2297c4d04afe275ba1c9a81 100644
--- a/fluid/PaddleRec/word2vec/reader.py
+++ b/fluid/PaddleRec/word2vec/reader.py
@@ -2,14 +2,32 @@
 
 import numpy as np
 import preprocess
-
 import logging
+import io
 
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
 
 
+class NumpyRandomInt(object):
+    def __init__(self, a, b, buf_size=1000):
+        self.idx = 0
+        self.buffer = np.random.random_integers(a, b, buf_size)
+        self.a = a
+        self.b = b
+
+    def __call__(self):
+        if self.idx == len(self.buffer):
+            self.buffer = np.random.random_integers(self.a, self.b,
+                                                    len(self.buffer))
+            self.idx = 0
+
+        result = self.buffer[self.idx]
+        self.idx += 1
+        return result
+
+
 class Word2VecReader(object):
     def __init__(self,
                  dict_path,
@@ -24,6 +42,7 @@ class Word2VecReader(object):
         self.num_non_leaf = 0
         self.word_to_id_ = dict()
         self.id_to_word = dict()
+        self.word_count = dict()
         self.word_to_path = dict()
         self.word_to_code = dict()
         self.trainer_id = trainer_id
@@ -33,41 +52,43 @@ class Word2VecReader(object):
         word_counts = []
         word_id = 0
 
-        with open(dict_path, 'r') as f:
+        with io.open(dict_path, 'r', encoding='utf-8') as f:
             for line in f:
                 word, count = line.split()[0], int(line.split()[1])
+                self.word_count[word] = count
                 self.word_to_id_[word] = word_id
                 self.id_to_word[word_id] = word  #build id to word dict
                 word_id += 1
                 word_counts.append(count)
                 word_all_count += count
 
-        with open(dict_path + "_word_to_id_", 'w+') as f6:
+        with io.open(dict_path + "_word_to_id_", 'w+', encoding='utf-8') as f6:
             for k, v in self.word_to_id_.items():
-                f6.write(str(k) + " " + str(v) + '\n')
+                f6.write(k + " " + str(v) + '\n')
 
         self.dict_size = len(self.word_to_id_)
         self.word_frequencys = [
             float(count) / word_all_count for count in word_counts
         ]
-        print("dict_size = " + str(
-            self.dict_size)) + " word_all_count = " + str(word_all_count)
+        print("dict_size = " + str(self.dict_size) + " word_all_count = " + str(
+            word_all_count))
 
-        with open(dict_path + "_ptable", 'r') as f2:
+        with io.open(dict_path + "_ptable", 'r', encoding='utf-8') as f2:
             for line in f2:
-                self.word_to_path[line.split(":")[0]] = np.fromstring(
-                    line.split(':')[1], dtype=int, sep=' ')
+                self.word_to_path[line.split('\t')[0]] = np.fromstring(
+                    line.split('\t')[1], dtype=int, sep=' ')
                 self.num_non_leaf = np.fromstring(
-                    line.split(':')[1], dtype=int, sep=' ')[0]
+                    line.split('\t')[1], dtype=int, sep=' ')[0]
         print("word_ptable dict_size = " + str(len(self.word_to_path)))
 
-        with open(dict_path + "_pcode", 'r') as f3:
+        with io.open(dict_path + "_pcode", 'r', encoding='utf-8') as f3:
             for line in f3:
-                self.word_to_code[line.split(":")[0]] = np.fromstring(
-                    line.split(':')[1], dtype=int, sep=' ')
+                self.word_to_code[line.split('\t')[0]] = np.fromstring(
+                    line.split('\t')[1], dtype=int, sep=' ')
         print("word_pcode dict_size = " + str(len(self.word_to_code)))
+        self.random_generator = NumpyRandomInt(1, self.window_size_ + 1)
 
-    def get_context_words(self, words, idx, window_size):
+    def get_context_words(self, words, idx):
         """
         Get the context word list of target word.
 
@@ -75,31 +96,38 @@ class Word2VecReader(object):
         idx: input word index
         window_size: window size
         """
-        target_window = np.random.randint(1, window_size + 1)
-        # need to keep in mind that maybe there are no enough words before the target word.
-        start_point = idx - target_window if (idx - target_window) > 0 else 0
+        target_window = self.random_generator()
+        start_point = idx - target_window  # if (idx - target_window) > 0 else 0
+        if start_point < 0:
+            start_point = 0
         end_point = idx + target_window
-        # context words of the target word
-        targets = set(words[start_point:idx] + words[idx + 1:end_point + 1])
-        return list(targets)
+        targets = words[start_point:idx] + words[idx + 1:end_point + 1]
+
+        return set(targets)
 
-    def train(self, with_hs):
+    def train(self, with_hs, with_other_dict):
         def _reader():
             for file in self.filelist:
-                with open(self.data_path_ + "/" + file, 'r') as f:
+                with io.open(
+                        self.data_path_ + "/" + file, 'r',
+                        encoding='utf-8') as f:
                     logger.info("running data in {}".format(self.data_path_ +
                                                             "/" + file))
                     count = 1
                     for line in f:
                         if self.trainer_id == count % self.trainer_num:
-                            line = preprocess.text_strip(line)
+                            if with_other_dict:
+                                line = preprocess.strip_lines(line,
+                                                              self.word_count)
+                            else:
+                                line = preprocess.text_strip(line)
                             word_ids = [
                                 self.word_to_id_[word] for word in line.split()
                                 if word in self.word_to_id_
                             ]
                             for idx, target_id in enumerate(word_ids):
                                 context_word_ids = self.get_context_words(
-                                    word_ids, idx, self.window_size_)
+                                    word_ids, idx)
                                 for context_id in context_word_ids:
                                     yield [target_id], [context_id]
                         else:
@@ -108,27 +136,33 @@ class Word2VecReader(object):
 
         def _reader_hs():
             for file in self.filelist:
-                with open(self.data_path_ + "/" + file, 'r') as f:
+                with io.open(
+                        self.data_path_ + "/" + file, 'r',
+                        encoding='utf-8') as f:
                     logger.info("running data in {}".format(self.data_path_ +
                                                             "/" + file))
                     count = 1
                     for line in f:
                         if self.trainer_id == count % self.trainer_num:
-                            line = preprocess.text_strip(line)
+                            if with_other_dict:
+                                line = preprocess.strip_lines(line,
+                                                              self.word_count)
+                            else:
+                                line = preprocess.text_strip(line)
                             word_ids = [
                                 self.word_to_id_[word] for word in line.split()
                                 if word in self.word_to_id_
                             ]
                             for idx, target_id in enumerate(word_ids):
                                 context_word_ids = self.get_context_words(
-                                    word_ids, idx, self.window_size_)
+                                    word_ids, idx)
                                 for context_id in context_word_ids:
                                     yield [target_id], [context_id], [
-                                        self.word_to_code[self.id_to_word[
-                                            context_id]]
-                                    ], [
                                         self.word_to_path[self.id_to_word[
-                                            context_id]]
+                                            target_id]]
+                                    ], [
+                                        self.word_to_code[self.id_to_word[
+                                            target_id]]
                                     ]
                         else:
                             pass
@@ -141,13 +175,20 @@ class Word2VecReader(object):
 
 
 if __name__ == "__main__":
-    window_size = 10
+    window_size = 5
+
+    reader = Word2VecReader(
+        "./data/1-billion_dict",
+        "./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/",
+        ["news.en-00001-of-00100"], 0, 1)
 
-    reader = Word2VecReader("data/enwik9_dict", "data/enwik9", window_size)
     i = 0
-    for x, y in reader.train()():
+    # print(reader.train(True))
+    for x, y, z, f in reader.train(True)():
         print("x: " + str(x))
         print("y: " + str(y))
+        print("path: " + str(z))
+        print("code: " + str(f))
         print("\n")
         if i == 10:
             exit(0)
diff --git a/fluid/PaddleRec/word2vec/train.py b/fluid/PaddleRec/word2vec/train.py
index 85fa0efdeca375c40a0e2d1eafc6a14ed2521049..2c8512e4954c60a9f0f0c45533c08170d4e6e09b 100644
--- a/fluid/PaddleRec/word2vec/train.py
+++ b/fluid/PaddleRec/word2vec/train.py
@@ -1,5 +1,4 @@
 from __future__ import print_function
-
 import argparse
 import logging
 import os
@@ -13,7 +12,7 @@ os.environ["CUDA_VISIBLE_DEVICES"] = ""
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.executor import global_scope
-
+import six
 import reader
 from network_conf import skip_gram_word2vec
 from infer import inference_test
@@ -30,7 +29,7 @@ def parse_args():
         '--train_data_path',
         type=str,
         default='./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled',
-        help="The path of training dataset")
+        help="The path of taining dataset")
     parser.add_argument(
         '--dict_path',
         type=str,
@@ -44,7 +43,7 @@ def parse_args():
     parser.add_argument(
         '--batch_size',
         type=int,
-        default=100,
+        default=1000,
         help="The size of mini-batch (default:100)")
     parser.add_argument(
         '--num_passes',
@@ -117,6 +116,13 @@ def parse_args():
         default=False,
         help='Do inference every 100 batches , (default: False)')
 
+    parser.add_argument(
+        '--with_other_dict',
+        action='store_true',
+        required=False,
+        default=False,
+        help='if use other dict , (default: False)')
+
     parser.add_argument(
         '--rank_num',
         type=int,
@@ -126,14 +132,44 @@ def parse_args():
     return parser.parse_args()
 
 
+def convert_python_to_tensor(batch_size, sample_reader, is_hs):
+    def __reader__():
+        result = None
+        if is_hs:
+            result = [[], [], [], []]
+        else:
+            result = [[], []]
+        for sample in sample_reader():
+            for i, fea in enumerate(sample):
+                result[i].append(fea)
+            if len(result[0]) == batch_size:
+                tensor_result = []
+                for tensor in result:
+                    t = fluid.Tensor()
+                    dat = np.array(tensor, dtype='int64')
+                    if len(dat.shape) > 2:
+                        dat = dat.reshape((dat.shape[0], dat.shape[2]))
+                    elif len(dat.shape) == 1:
+                        dat = dat.reshape((-1, 1))
+                    t.set(dat, fluid.CPUPlace())
+
+                    tensor_result.append(t)
+                yield tensor_result
+                if is_hs:
+                    result = [[], [], [], []]
+                else:
+                    result = [[], []]
+
+    return __reader__
+
+
 def train_loop(args, train_program, reader, py_reader, loss, trainer_id):
-    train_reader = paddle.batch(
-        paddle.reader.shuffle(
-            reader.train((args.with_hs or (not args.with_nce))),
-            buf_size=args.batch_size * 100),
-        batch_size=args.batch_size)
 
-    py_reader.decorate_paddle_reader(train_reader)
+    py_reader.decorate_tensor_provider(
+        convert_python_to_tensor(args.batch_size,
+                                 reader.train((args.with_hs or (
+                                     not args.with_nce)), args.with_other_dict),
+                                 (args.with_hs or (not args.with_nce))))
 
     place = fluid.CPUPlace()
 
@@ -141,6 +177,7 @@ def train_loop(args, train_program, reader, py_reader, loss, trainer_id):
     exe.run(fluid.default_startup_program())
 
     exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.use_experimental_executor = True
 
     print("CPU_NUM:" + str(os.getenv("CPU_NUM")))
     exec_strategy.num_threads = int(os.getenv("CPU_NUM"))
@@ -162,36 +199,27 @@ def train_loop(args, train_program, reader, py_reader, loss, trainer_id):
     profiler_step_end = 30
 
     for pass_id in range(args.num_passes):
-        epoch_start = time.time()
         py_reader.start()
+        time.sleep(10)
+        epoch_start = time.time()
         batch_id = 0
-        start = time.clock()
+        start = time.time()
 
         try:
             while True:
 
-                if profiler_step == profiler_step_start:
-                    fluid.profiler.start_profiler(profile_state)
-
                 loss_val = train_exe.run(fetch_list=[loss.name])
                 loss_val = np.mean(loss_val)
 
-                if profiler_step == profiler_step_end:
-                    fluid.profiler.stop_profiler('total', 'trainer_profile.log')
-                    profiler_step += 1
-                else:
-                    profiler_step += 1
-
                 if batch_id % 50 == 0:
                     logger.info(
                         "TRAIN --> pass: {} batch: {} loss: {} reader queue:{}".
                         format(pass_id, batch_id,
-                               loss_val.mean() / args.batch_size,
-                               py_reader.queue.size()))
+                               loss_val.mean(), py_reader.queue.size()))
                 if args.with_speed:
                     if batch_id % 1000 == 0 and batch_id != 0:
-                        elapsed = (time.clock() - start)
-                        start = time.clock()
+                        elapsed = (time.time() - start)
+                        start = time.time()
                         samples = 1001 * args.batch_size * int(
                             os.getenv("CPU_NUM"))
                         logger.info("Time used: {}, Samples/Sec: {}".format(
@@ -240,7 +268,7 @@ def train(args):
             args.dict_path, args.train_data_path, filelist, 0, 1)
     else:
         trainer_id = int(os.environ["PADDLE_TRAINER_ID"])
-        trainers = int(os.environ["PADDLE_TRAINERS"])
+        trainer_num = int(os.environ["PADDLE_TRAINERS"])
         word2vec_reader = reader.Word2VecReader(args.dict_path,
                                                 args.train_data_path, filelist,
                                                 trainer_id, trainer_num)
@@ -257,9 +285,9 @@ def train(args):
 
     optimizer = None
     if args.with_Adam:
-        optimizer = fluid.optimizer.Adam(learning_rate=1e-3)
+        optimizer = fluid.optimizer.Adam(learning_rate=1e-4, lazy_mode=True)
     else:
-        optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+        optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
 
     optimizer.minimize(loss)
 
diff --git a/legacy/README.md b/legacy/README.md
index f7741c9b7b1c39e569e74606d054847b27a206d8..f0719c1a26c04341e8de327143dc826248bb3607 100644
--- a/legacy/README.md
+++ b/legacy/README.md
@@ -1,3 +1,6 @@
+
+# 该目录的模型已经不再维护，不推荐使用。建议使用Fluid目录下的模型。
+
 # Introduction to models
 
 [![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://github.com/PaddlePaddle/models)