Merge pull request #12 from PaddlePaddle/develop

update

Merge pull request #12 from PaddlePaddle/develop
update
e6ed31ef · zhengya01 · GitHub · 366d6612 · 747f947e · e6ed31ef
132 changed file
--- a/README.md
+++ b/README.md
@@ -8,8 +8,66 @@ PaddlePaddle provides a rich set of computational units to enable users to adopt
 - [fluid models](fluid): use PaddlePaddle's Fluid APIs. We especially recommend users to use Fluid models.
- [legacy models](legacy): use PaddlePaddle's v2 APIs.
+PaddlePaddle 提供了丰富的计算单元，使得用户可以采用模块化的方法解决各种学习问题。在此repo中，我们展示了如何用 PaddlePaddle 来解决常见的机器学习任务，提供若干种不同的易学易用的神经网络模型。
+- [fluid模型](fluid): 使用 PaddlePaddle Fluid版本的 APIs，我们特别推荐您使用Fluid模型。
+## PaddleCV
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[AlexNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|首次在CNN中成功的应用了ReLU、Dropout和LRN，并使用GPU进行运算加速|[ImageNet Classification with Deep Convolutional Neural Networks](https://www.researchgate.net/publication/267960550_ImageNet_Classification_with_Deep_Convolutional_Neural_Networks)
+[VGG](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在AlexNet的基础上使用3*3小卷积核，增加网络深度，具有很好的泛化能力|[Very Deep ConvNets for Large-Scale Inage Recognition](https://arxiv.org/pdf/1409.1556.pdf)
+[GoogleNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|在不增加计算负载的前提下增加了网络的深度和宽度，性能更加优越|[Going deeper with convolutions](https://ieeexplore.ieee.org/document/7298594)
+[ResNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|残差网络|引入了新的残差结构，解决了随着网络加深，准确率下降的问题|[Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)
+[Inception-v4](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类经典模型|更加deeper和wider的inception结构|[Inception-ResNet and the Impact of Residual Connections on Learning](http://arxiv.org/abs/1602.07261)
+[MobileNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|轻量级网络模型|为移动和嵌入式设备提出的高效模型|[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)
+[DPN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|结合了DenseNet和ResNeXt的网络结构，对图像分类效果有所提升|[Dual Path Networks](https://arxiv.org/abs/1707.01629)
+[SE-ResNeXt](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/image_classification/models)|图像分类模型|ResNeXt中加入了SE block，提高了模型准确率|[Squeeze-and-excitation networks](https://arxiv.org/abs/1709.01507)
+[SSD](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/object_detection/README_cn.md)|单阶段目标检测器|在不同尺度的特征图上检测对应尺度的目标,可以方便地插入到任何一种标准卷积网络中|[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325)
+[Face Detector: PyramidBox](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/face_detection/README_cn.md)|基于SSD的单阶段人脸检测器|利用上下文信息解决困难人脸的检测问题，网络表达能力高，鲁棒性强|[PyramidBox: A Context-assisted Single Shot Face Detector](https://arxiv.org/pdf/1803.07737.pdf)
+[Faster RCNN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/faster_rcnn/README_cn.md)|典型的两阶段目标检测器|创造性地采用卷积网络自行产生建议框，并且和目标检测网络共享卷积网络，建议框数目减少，质量提高|[Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
+[ICNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/icnet)|图像实时语义分割模型|即考虑了速度，也考虑了准确性，在高分辨率图像的准确性和低复杂度网络的效率之间获得平衡|[ICNet for Real-Time Semantic Segmentation on High-Resolution Images](https://arxiv.org/abs/1704.08545)
+[DCGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|深度卷积生成对抗网络，将GAN和卷积网络结合起来，以解决GAN训练不稳定的问题|[Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/pdf/1511.06434.pdf)
+[ConditionalGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/c_gan)|图像生成模型|条件生成对抗网络，一种带条件约束的GAN，使用额外信息对模型增加条件，可以指导数据生成过程|[Conditional Generative Adversarial Nets](https://arxiv.org/abs/1411.1784)
+[CycleGAN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/gan/cycle_gan)|图片转化模型|自动将某一类图片转换成另外一类图片，可用于风格迁移|[Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks](https://arxiv.org/abs/1703.10593)
+[CRNN-CTC模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用CTC model识别图片中单行英文字符|[Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks](https://www.researchgate.net/publication/221346365_Connectionist_temporal_classification_Labelling_unsegmented_sequence_data_with_recurrent_neural_'networks)
+[Attention模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/ocr_recognition)|场景文字识别模型|使用attention 识别图片中单行英文字符|[Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)
+[Metric Learning](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/metric_learning)|度量学习模型|能够用于分析对象时间的关联、比较关系，可应用于辅助分类、聚类问题，也广泛用于图像检索、人脸识别等领域|-
+[TSN](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/video_classification)|视频分类模型|基于长范围时间结构建模，结合了稀疏时间采样策略和视频级监督来保证使用整段视频时学习得有效和高效|[Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859)
+[caffe2fluid](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleCV/caffe2fluid)|将Caffe模型转换为Paddle Fluid配置和模型文件工具|-|-
+## PaddleNLP
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[Transformer](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/neural_machine_translation/transformer/README_cn.md)|机器翻译模型|基于self-attention，计算复杂度小，并行度高，容易学习长程依赖，翻译效果更好|[Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+[LAC](https://github.com/baidu/lac/blob/master/README.md)|联合的词法分析模型|能够整体性地完成中文分词、词性标注、专名识别任务|[Chinese Lexical Analysis with Deep Bi-GRU-CRF Network](https://arxiv.org/abs/1807.01882)
+[Senta](https://github.com/baidu/Senta/blob/master/README.md)|情感倾向分析模型集|百度AI开放平台中情感倾向分析模型|-
+[DAM](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/deep_attention_matching_net)|语义匹配模型|百度自然语言处理部发表于ACL-2018的工作,用于检索式聊天机器人多轮对话中应答的选择|[Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network](http://aclweb.org/anthology/P18-1103)
+[SimNet](https://github.com/baidu/AnyQ/blob/master/tools/simnet/train/paddle/README.md)|语义匹配框架|使用SimNet构建出的模型可以便捷的加入AnyQ系统中，增强AnyQ系统的语义匹配能力|-
+[DuReader](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleNLP/machine_reading_comprehension/README.md)|阅读理解模型|百度MRC数据集上的机器阅读理解模型|-
+[Bi-GRU-CRF](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleNLP/sequence_tagging_for_ner/README.md)|命名实体识别|结合了CRF和双向GRU的命名实体识别模型|-
+## PaddleRec
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[TagSpace](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/tagspace)|文本及标签的embedding表示学习模型|应用于工业级的标签推荐，具体应用场景有feed新闻标签推荐等|[#TagSpace: Semantic embeddings from hashtags](https://www.bibsonomy.org/bibtex/0ed4314916f8e7c90d066db45c293462)
+[GRU4Rec](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec)|个性化推荐模型|首次将RNN（GRU）运用于session-based推荐，相比传统的KNN和矩阵分解，效果有明显的提升|[Session-based Recommendations with Recurrent Neural Networks](https://arxiv.org/abs/1511.06939)
+[SSR](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/ssr)|序列语义检索推荐模型|使用参考论文中的思想，使用多种时间粒度进行用户行为预测|[Multi-Rate Deep Learning for Temporal Recommendation](https://dl.acm.org/citation.cfm?id=2914726)
+[DeepCTR](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleRec/ctr/README.cn.md)|点击率预估模型|只实现了DeepFM论文中介绍的模型的DNN部分，DeepFM会在其他例子中给出|[DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/abs/1703.04247)
+[Multiview-Simnet](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/multiview_simnet)|个性化推荐模型|基于多元视图，将用户和项目的多个功能视图合并为一个统一模型|[A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](http://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf)
+## Other Models
+模型|简介|模型优势|参考论文
+--|:--:|:--:|:--:
+[DeepASR](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/README_cn.md)|语音识别系统|利用Fluid框架完成语音识别中声学模型的配置和训练，并集成 Kaldi 的解码器|-
+[DQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|深度Q网络|value based强化学习算法，第一个成功地将深度学习和强化学习结合起来的模型|[Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)
+[DoubleDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|将Double Q的想法应用在DQN上，解决过优化问题|[Font Size: Deep Reinforcement Learning with Double Q-Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
+[DuelingDQN](https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepQNetwork/README_cn.md)|DQN的变体|改进了DQN模型，提高了模型的性能|[Dueling Network Architectures for Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.html)
 ## License
 This tutorial is contributed by [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) and licensed under the [Apache-2.0 license](LICENSE).
+## 许可证书
+此向导由[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)贡献，受[Apache-2.0 license](LICENSE)许可认证.
--- a/fluid/PaddleCV/HiNAS_models/nn_paddle.py
+++ b/fluid/PaddleCV/HiNAS_models/nn_paddle.py
@@ -21,6 +21,7 @@ import math
 import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid.contrib.trainer import *
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
 import reader
@@ -104,7 +105,7 @@ class Model(object):
        accs = []
        def event_handler(event):
-            if isinstance(event, fluid.EndStepEvent):
+            if isinstance(event, EndStepEvent):
                costs.append(event.metrics[0])
                accs.append(event.metrics[1])
                if event.step % 20 == 0:
@@ -113,7 +114,7 @@ class Model(object):
                    del costs[:]
                    del accs[:]
-            if isinstance(event, fluid.EndEpochEvent):
+            if isinstance(event, EndEpochEvent):
                if event.epoch % 3 == 0 or event.epoch == FLAGS.num_epochs - 1:
                    avg_cost, accuracy = trainer.test(
                        reader=test_reader, feed_order=['pixel', 'label'])
@@ -126,7 +127,7 @@ class Model(object):
        event_handler.best_acc = 0.0
        place = fluid.CUDAPlace(0)
-        trainer = fluid.Trainer(
+        trainer = Trainer(
            train_func=self.train_network,
            optimizer_func=self.optimizer_program,
            place=place)

--- a/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py
+++ b/fluid/PaddleCV/caffe2fluid/kaffe/paddle/network.py
@@ -440,7 +440,8 @@ class Network(object):
        if need_transpose:
            order = range(dims)
-            order.remove(axis).append(axis)
+            order.remove(axis)
+            order.append(axis)
            input = fluid.layers.transpose(
                input,
                perm=order,
@@ -525,11 +526,21 @@ class Network(object):
        scale_shape = input.shape[axis:axis + num_axes]
        param_attr = fluid.ParamAttr(name=prefix + 'scale')
        scale_param = fluid.layers.create_parameter(
-            shape=scale_shape, dtype=input.dtype, name=name, attr=param_attr)
+            shape=scale_shape,
+            dtype=input.dtype,
+            name=name,
+            attr=param_attr,
+            is_bias=True,
+            default_initializer=fluid.initializer.Constant(value=1.0))
        offset_attr = fluid.ParamAttr(name=prefix + 'offset')
        offset_param = fluid.layers.create_parameter(
-            shape=scale_shape, dtype=input.dtype, name=name, attr=offset_attr)
+            shape=scale_shape,
+            dtype=input.dtype,
+            name=name,
+            attr=offset_attr,
+            is_bias=True,
+            default_initializer=fluid.initializer.Constant(value=0.0))
        output = fluid.layers.elementwise_mul(
            input,

--- a/fluid/PaddleCV/deeplabv3+/.run_ce.sh
+++ b/fluid/PaddleCV/deeplabv3+/.run_ce.sh
+#!/bin/bash
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+DATASET_PATH=${HOME}/.cache/paddle/dataset/cityscape/
+cudaid=${deeplabv3plus:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py \
+--batch_size=2 \
+--train_crop_size=769 \
+--total_step=50 \
+--save_weights_path=output1 \
+--dataset_path=$DATASET_PATH \
+--enable_ce | python _ce.py
+cudaid=${deeplabv3plus_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py \
+--batch_size=2 \
+--train_crop_size=769 \
+--total_step=50 \
+--save_weights_path=output4 \
+--dataset_path=$DATASET_PATH \
+--enable_ce | python _ce.py
--- a/fluid/PaddleCV/deeplabv3+/README.md
+++ b/fluid/PaddleCV/deeplabv3+/README.md
@@ -76,7 +76,7 @@ python ./train.py \
    --train_crop_size=769 \
    --total_step=90000 \
    --init_weights_path=deeplabv3plus_xception65_initialize.params \
-    --save_weights_path=output \
+    --save_weights_path=output/ \
    --dataset_path=$DATASET_PATH
 ```

--- a/fluid/PaddleRec/ssr/__init__.py
+++ b/fluid/PaddleRec/ssr/__init__.py
--- a/fluid/PaddleCV/deeplabv3+/_ce.py
+++ b/fluid/PaddleCV/deeplabv3+/_ce.py
+# this file is only used for continuous evaluation test!
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.1, 0, actived=True)
+train_loss_card1_kpi = CostKpi('train_loss_card1', 0.05, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.1, 0, actived=True)
+train_loss_card4_kpi = CostKpi('train_loss_card4', 0.05, 0)
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_loss_card4_kpi,
+        ]
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+    The suggestion:
+    each line in the log should be key, value, for example:
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
--- a/fluid/PaddleCV/deeplabv3+/eval.py
+++ b/fluid/PaddleCV/deeplabv3+/eval.py
@@ -26,6 +26,7 @@ def add_arguments():
    add_argument('dataset_path', str, None, "Cityscape dataset path.")
    add_argument('verbose', bool, False, "Print mIoU for each step if verbose.")
    add_argument('use_gpu', bool, True, "Whether use GPU or CPU.")
+    add_argument('num_classes', int, 19, "Number of classes.")
 def mean_iou(pred, label):
@@ -69,7 +70,7 @@ tp = fluid.Program()
 batch_size = 1
 reader.default_config['crop_size'] = -1
 reader.default_config['shuffle'] = False
-num_classes = 19
+num_classes = args.num_classes
 with fluid.program_guard(tp, sp):
    img = fluid.layers.data(name='img', shape=[3, 0, 0], dtype='float32')
@@ -84,7 +85,7 @@ tp = tp.clone(True)
 fluid.memory_optimize(
    tp,
    print_log=False,
-    skip_opt_set=[pred.name, miou, out_wrong, out_correct],
+    skip_opt_set=set([pred.name, miou, out_wrong, out_correct]),
    level=1)
 place = fluid.CPUPlace()

--- a/fluid/PaddleCV/deeplabv3+/models.py
+++ b/fluid/PaddleCV/deeplabv3+/models.py
@@ -20,6 +20,11 @@ op_results = {}
 default_epsilon = 1e-3
 default_norm_type = 'bn'
 default_group_number = 32
+depthwise_use_cudnn = False
+bn_regularizer = fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0)
+depthwise_regularizer = fluid.regularizer.L2DecayRegularizer(
+    regularization_coeff=0.0)
 @contextlib.contextmanager
@@ -52,20 +57,39 @@ def append_op_result(result, name):
 def conv(*args, **kargs):
-    kargs['param_attr'] = name_scope + 'weights'
+    if "xception" in name_scope:
+        init_std = 0.09
+    elif "logit" in name_scope:
+        init_std = 0.01
+    elif name_scope.endswith('depthwise/'):
+        init_std = 0.33
+    else:
+        init_std = 0.06
+    if name_scope.endswith('depthwise/'):
+        regularizer = depthwise_regularizer
+    else:
+        regularizer = None
+    kargs['param_attr'] = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=regularizer,
+        initializer=fluid.initializer.TruncatedNormal(
+            loc=0.0, scale=init_std))
    if 'bias_attr' in kargs and kargs['bias_attr']:
-        kargs['bias_attr'] = name_scope + 'biases'
+        kargs['bias_attr'] = fluid.ParamAttr(
+            name=name_scope + 'biases',
+            regularizer=regularizer,
+            initializer=fluid.initializer.ConstantInitializer(value=0.0))
    else:
        kargs['bias_attr'] = False
+    kargs['name'] = name_scope + 'conv'
    return append_op_result(fluid.layers.conv2d(*args, **kargs), 'conv')
 def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
-    helper = fluid.layer_helper.LayerHelper('group_norm', **locals())
    N, C, H, W = input.shape
    if C % G != 0:
-        print("group can not divide channle:", C, G)
+        # print "group can not divide channle:", C, G
        for d in range(10):
            for t in [d, -d]:
                if G + t <= 0: continue
@@ -73,29 +97,16 @@ def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
                    G = G + t
                    break
            if C % G == 0:
-                print("use group size:", G)
+                # print "use group size:", G
                break
    assert C % G == 0
-    param_shape = (G, )
+    x = fluid.layers.group_norm(
-    x = input
+        input,
-    x = fluid.layers.reshape(x, [N, G, C // G * H * W])
+        groups=G,
-    mean = fluid.layers.reduce_mean(x, dim=2, keep_dim=True)
+        param_attr=param_attr,
-    x = x - mean
+        bias_attr=bias_attr,
-    var = fluid.layers.reduce_mean(fluid.layers.square(x), dim=2, keep_dim=True)
+        name=name_scope + 'group_norm')
-    x = x / fluid.layers.sqrt(var + eps)
+    return x
-    scale = helper.create_parameter(
-        attr=helper.param_attr,
-        shape=param_shape,
-        dtype='float32',
-        default_initializer=fluid.initializer.Constant(1.0))
-    bias = helper.create_parameter(
-        attr=helper.bias_attr, shape=param_shape, dtype='float32', is_bias=True)
-    x = fluid.layers.elementwise_add(
-        fluid.layers.elementwise_mul(
-            x, scale, axis=1), bias, axis=1)
-    return fluid.layers.reshape(x, input.shape)
 def bn(*args, **kargs):
@@ -106,8 +117,10 @@ def bn(*args, **kargs):
                    *args,
                    epsilon=default_epsilon,
                    momentum=bn_momentum,
-                    param_attr=name_scope + 'gamma',
+                    param_attr=fluid.ParamAttr(
-                    bias_attr=name_scope + 'beta',
+                        name=name_scope + 'gamma', regularizer=bn_regularizer),
+                    bias_attr=fluid.ParamAttr(
+                        name=name_scope + 'beta', regularizer=bn_regularizer),
                    moving_mean_name=name_scope + 'moving_mean',
                    moving_variance_name=name_scope + 'moving_variance',
                    **kargs),
@@ -119,8 +132,10 @@ def bn(*args, **kargs):
                    args[0],
                    default_group_number,
                    eps=default_epsilon,
-                    param_attr=name_scope + 'gamma',
+                    param_attr=fluid.ParamAttr(
-                    bias_attr=name_scope + 'beta'),
+                        name=name_scope + 'gamma', regularizer=bn_regularizer),
+                    bias_attr=fluid.ParamAttr(
+                        name=name_scope + 'beta', regularizer=bn_regularizer)),
                'gn')
    else:
        raise "Unsupport norm type:" + default_norm_type
@@ -143,7 +158,8 @@ def seq_conv(input, channel, stride, filter, dilation=1, act=None):
            stride,
            groups=input.shape[1],
            padding=(filter // 2) * dilation,
-            dilation=dilation)
+            dilation=dilation,
+            use_cudnn=depthwise_use_cudnn)
        input = bn(input)
        if act: input = act(input)
    with scope('pointwise'):

--- a/fluid/PaddleCV/deeplabv3+/train.py
+++ b/fluid/PaddleCV/deeplabv3+/train.py
@@ -13,6 +13,7 @@ import reader
 import models
 import time
 def add_argument(name, type, default, help):
    parser.add_argument('--' + name, default=default, type=type, help=help)
@@ -32,15 +33,35 @@ def add_arguments():
    add_argument('dataset_path', str, None, "Cityscape dataset path.")
    add_argument('parallel', bool, False, "using ParallelExecutor.")
    add_argument('use_gpu', bool, True, "Whether use GPU or CPU.")
+    add_argument('num_classes', int, 19, "Number of classes.")
+    parser.add_argument(
+        '--enable_ce',
+        action='store_true',
+        help='If set, run the task with continuous evaluation logs.')
 def load_model():
+    myvars = [
+        x for x in tp.list_vars()
+        if isinstance(x, fluid.framework.Parameter) and x.name.find('logit') ==
+        -1
+    ]
    if args.init_weights_path.endswith('/'):
-        fluid.io.load_params(
+        if args.num_classes == 19:
-            exe, dirname=args.init_weights_path, main_program=tp)
+            fluid.io.load_params(
+                exe, dirname=args.init_weights_path, main_program=tp)
+        else:
+            fluid.io.load_vars(exe, dirname=args.init_weights_path, vars=myvars)
    else:
-        fluid.io.load_params(
+        if args.num_classes == 19:
-            exe, dirname="", filename=args.init_weights_path, main_program=tp)
+            fluid.io.load_params(
+                exe,
+                dirname="",
+                filename=args.init_weights_path,
+                main_program=tp)
+        else:
+            fluid.io.load_vars(
+                exe, dirname="", filename=args.init_weights_path, vars=myvars)
 def save_model():
@@ -70,6 +91,15 @@ def loss(logit, label):
    return loss, label_nignore
+def get_cards(args):
+    if args.enable_ce:
+        cards = os.environ.get('CUDA_VISIBLE_DEVICES')
+        num = len(cards.split(","))
+        return num
+    else:
+        return args.num_devices
 CityscapeDataset = reader.CityscapeDataset
 parser = argparse.ArgumentParser()
@@ -80,16 +110,24 @@ args = parser.parse_args()
 models.clean()
 models.bn_momentum = 0.9997
 models.dropout_keep_prop = 0.9
+models.label_number = args.num_classes
 deeplabv3p = models.deeplabv3p
 sp = fluid.Program()
 tp = fluid.Program()
+# only for ce
+if args.enable_ce:
+    SEED = 102
+    sp.random_seed = SEED
+    tp.random_seed = SEED
 crop_size = args.train_crop_size
 batch_size = args.batch_size
 image_shape = [crop_size, crop_size]
 reader.default_config['crop_size'] = crop_size
 reader.default_config['shuffle'] = True
-num_classes = 19
+num_classes = args.num_classes
 weight_decay = 0.00004
 base_lr = args.base_lr
@@ -120,7 +158,7 @@ with fluid.program_guard(tp, sp):
    retv = opt.minimize(loss_mean, startup_program=sp, no_grad_set=no_grad_set)
 fluid.memory_optimize(
-    tp, print_log=False, skip_opt_set=[pred.name, loss_mean.name], level=1)
+    tp, print_log=False, skip_opt_set=set([pred.name, loss_mean.name]), level=1)
 place = fluid.CPUPlace()
 if args.use_gpu:
@@ -140,7 +178,13 @@ if args.parallel:
 batches = dataset.get_batch_generator(batch_size, total_step)
+total_time = 0.0
+epoch_idx = 0
+train_loss = 0
 for i, imgs, labels, names in batches:
+    epoch_idx += 1
+    begin_time = time.time()
    prev_start_time = time.time()
    if args.parallel:
        retv = exe_p.run(fetch_list=[pred.name, loss_mean.name],
@@ -152,11 +196,21 @@ for i, imgs, labels, names in batches:
                             'label': labels},
                       fetch_list=[pred, loss_mean])
    end_time = time.time()
+    total_time += end_time - begin_time
    if i % 100 == 0:
        print("Model is saved to", args.save_weights_path)
        save_model()
-    print("step {:d}, loss: {:.6f}, step_time_cost: {:.3f}" .format(i,
+    print("step {:d}, loss: {:.6f}, step_time_cost: {:.3f}".format(
-                    np.mean(retv[1]), end_time - prev_start_time))
+        i, np.mean(retv[1]), end_time - prev_start_time))
+    # only for ce
+    train_loss = np.mean(retv[1])
+if args.enable_ce:
+    gpu_num = get_cards(args)
+    print("kpis\teach_pass_duration_card%s\t%s" %
+          (gpu_num, total_time / epoch_idx))
+    print("kpis\ttrain_loss_card%s\t%s" % (gpu_num, train_loss))
 print("Training done. Model is saved to", args.save_weights_path)
 save_model()
--- a/fluid/PaddleCV/face_detection/.run_ce.sh
+++ b/fluid/PaddleCV/face_detection/.run_ce.sh
+#!/bin/bash
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+cudaid=${face_detection:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py --batch_size=2 --epoc_num=1 --batch_num=200 --parallel=False --enable_ce | python _ce.py
+cudaid=${face_detection_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py --batch_size=8 --epoc_num=1 --batch_num=200 --parallel=False --enable_ce | python _ce.py
--- a/fluid/PaddleCV/face_detection/README_cn.md
+++ b/fluid/PaddleCV/face_detection/README_cn.md
@@ -99,7 +99,7 @@ python -u train.py --batch_size=16 --pretrained_model=vgg_ilsvrc_16_fc_reduced
 模型训练所采用的数据增强：
-**数据增强**：数据的读取行为定义在 `reader.py` 中，所有的图片都会被缩放到640x640。在训练时还会对图片进行数据增强，包括随机扰动、翻转、裁剪等，和[物体检测SSD算法](https://github.com/PaddlePaddle/models/blob/develop/fluid/object_detection/README_cn.md#%E8%AE%AD%E7%BB%83-pascal-voc-%E6%95%B0%E6%8D%AE%E9%9B%86)中数据增强类似，除此之外，增加了上面提到的Data-anchor-sampling:
+**数据增强**：数据的读取行为定义在 `reader.py` 中，所有的图片都会被缩放到640x640。在训练时还会对图片进行数据增强，包括随机扰动、翻转、裁剪等，和[物体检测SSD算法](https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/object_detection/README.md)中数据增强类似，除此之外，增加了上面提到的Data-anchor-sampling:
  **尺度变换(Data-anchor-sampling)**：随机将图片尺度变换到一定范围的尺度，大大增强人脸的尺度变化。具体操作为根据随机选择的人脸高(height)和宽(width)，得到$v=\\sqrt{width * height}$，判断$v$的值位于缩放区间$[16，32，64，128，256，512]$中的的哪一个。假设$v=45$，则选定$32<v<64$，以均匀分布的概率选取$[16，32，64]$中的任意一个值。若选中$64$，则该人脸的缩放区间在 $[64 / 2，min(v * 2, 64 * 2)]$中选定。

--- a/fluid/PaddleCV/face_detection/__init__.py
+++ b/fluid/PaddleCV/face_detection/__init__.py
--- a/fluid/PaddleCV/face_detection/_ce.py
+++ b/fluid/PaddleCV/face_detection/_ce.py
+# this file is only used for continuous evaluation test!
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.08, 0, actived=True)
+train_face_loss_card1_kpi = CostKpi('train_face_loss_card1', 0.08, 0)
+train_head_loss_card1_kpi = CostKpi('train_head_loss_card1', 0.08, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.08, 0, actived=True)
+train_face_loss_card4_kpi = CostKpi('train_face_loss_card4', 0.08, 0)
+train_head_loss_card4_kpi = CostKpi('train_head_loss_card4', 0.08, 0)
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_face_loss_card1_kpi,
+        train_head_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_face_loss_card4_kpi,
+        train_head_loss_card4_kpi,
+        ]
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+    The suggestion:
+    each line in the log should be key, value, for example:
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
--- a/fluid/PaddleCV/face_detection/data_util.py
+++ b/fluid/PaddleCV/face_detection/data_util.py
-"""
-This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
-"""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-import time
-import numpy as np
-import threading
-import multiprocessing
-import traceback
-try:
-    import queue
-except ImportError:
-    import Queue as queue
-class GeneratorEnqueuer(object):
-    """
-    Builds a queue out of a data generator.
-    Args:
-        generator: a generator function which endlessly yields data
-        use_multiprocessing (bool): use multiprocessing if True,
-            otherwise use threading.
-        wait_time (float): time to sleep in-between calls to `put()`.
-        random_seed (int): Initial seed for workers,
-            will be incremented by one for each workers.
-    """
-    def __init__(self,
-                 generator,
-                 use_multiprocessing=False,
-                 wait_time=0.05,
-                 random_seed=None):
-        self.wait_time = wait_time
-        self._generator = generator
-        self._use_multiprocessing = use_multiprocessing
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-        self._manager = None
-        self.seed = random_seed
-    def start(self, workers=1, max_queue_size=10):
-        """
-        Start worker threads which add data from the generator into the queue.
-        Args:
-            workers (int): number of worker threads
-            max_queue_size (int): queue size
-                (when full, threads could block on `put()`)
-        """
-        def data_generator_task():
-            """
-            Data generator task.
-            """
-            def task():
-                if (self.queue is not None and
-                        self.queue.qsize() < max_queue_size):
-                    generator_output = next(self._generator)
-                    self.queue.put((generator_output))
-                else:
-                    time.sleep(self.wait_time)
-            if not self._use_multiprocessing:
-                while not self._stop_event.is_set():
-                    with self.genlock:
-                        try:
-                            task()
-                        except Exception:
-                            traceback.print_exc()
-                            self._stop_event.set()
-                            break
-            else:
-                while not self._stop_event.is_set():
-                    try:
-                        task()
-                    except Exception:
-                        traceback.print_exc()
-                        self._stop_event.set()
-                        break
-        try:
-            if self._use_multiprocessing:
-                self._manager = multiprocessing.Manager()
-                self.queue = self._manager.Queue(maxsize=max_queue_size)
-                self._stop_event = multiprocessing.Event()
-            else:
-                self.genlock = threading.Lock()
-                self.queue = queue.Queue()
-                self._stop_event = threading.Event()
-            for _ in range(workers):
-                if self._use_multiprocessing:
-                    # Reset random seed else all children processes
-                    # share the same seed
-                    np.random.seed(self.seed)
-                    thread = multiprocessing.Process(target=data_generator_task)
-                    thread.daemon = True
-                    if self.seed is not None:
-                        self.seed += 1
-                else:
-                    thread = threading.Thread(target=data_generator_task)
-                self._threads.append(thread)
-                thread.start()
-        except:
-            self.stop()
-            raise
-    def is_running(self):
-        """
-        Returns:
-            bool: Whether the worker theads are running.
-        """
-        return self._stop_event is not None and not self._stop_event.is_set()
-    def stop(self, timeout=None):
-        """
-        Stops running threads and wait for them to exit, if necessary.
-        Should be called by the same thread which called `start()`.
-        Args:
-            timeout(int|None): maximum time to wait on `thread.join()`.
-        """
-        if self.is_running():
-            self._stop_event.set()
-        for thread in self._threads:
-            if self._use_multiprocessing:
-                if thread.is_alive():
-                    thread.terminate()
-            else:
-                thread.join(timeout)
-        if self._manager:
-            self._manager.shutdown()
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-    def get(self):
-        """
-        Creates a generator to extract data from the queue.
-        Skip the data if it is `None`.
-        # Yields
-            tuple of data in the queue.
-        """
-        while self.is_running():
-            if not self.queue.empty():
-                inputs = self.queue.get()
-                if inputs is not None:
-                    yield inputs
-            else:
-                time.sleep(self.wait_time)
--- a/fluid/PaddleCV/face_detection/reader.py
+++ b/fluid/PaddleCV/face_detection/reader.py
@@ -16,8 +16,6 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
-import image_util
-from paddle.utils.image_util import *
 from PIL import Image
 from PIL import ImageDraw
 import numpy as np
@@ -28,7 +26,10 @@ import copy
 import random
 import cv2
 import six
-from data_util import GeneratorEnqueuer
+import math
+from itertools import islice
+import paddle
+import image_util
 class Settings(object):
@@ -199,7 +200,7 @@ def load_file_list(input_txt):
            else:
                file_dict[num_class].append(line_txt)
-    return file_dict
+    return list(file_dict.values())
 def expand_bboxes(bboxes,
@@ -227,13 +228,12 @@ def expand_bboxes(bboxes,
 def train_generator(settings, file_list, batch_size, shuffle=True):
-    file_dict = load_file_list(file_list)
+    def reader():
-    while True:
        if shuffle:
-            np.random.shuffle(file_dict)
+            np.random.shuffle(file_list)
        batch_out = []
-        for index_image in file_dict.keys():
+        for item in file_list:
-            image_name = file_dict[index_image][0]
+            image_name = item[0]
            image_path = os.path.join(settings.data_dir, image_name)
            im = Image.open(image_path)
            if im.mode == 'L':
@@ -242,10 +242,10 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
            # layout: label | xmin | ymin | xmax | ymax
            bbox_labels = []
-            for index_box in range(len(file_dict[index_image])):
+            for index_box in range(len(item)):
                if index_box >= 2:
                    bbox_sample = []
-                    temp_info_box = file_dict[index_image][index_box].split(' ')
+                    temp_info_box = item[index_box].split(' ')
                    xmin = float(temp_info_box[0])
                    ymin = float(temp_info_box[1])
                    w = float(temp_info_box[2])
@@ -277,43 +277,25 @@ def train_generator(settings, file_list, batch_size, shuffle=True):
                yield batch_out
                batch_out = []
+    return reader
-def train(settings,
-          file_list,
-          batch_size,
-          shuffle=True,
-          use_multiprocessing=True,
-          num_workers=8,
-          max_queue=24):
-    def reader():
-        try:
-            enqueuer = GeneratorEnqueuer(
-                train_generator(settings, file_list, batch_size, shuffle),
-                use_multiprocessing=use_multiprocessing)
-            enqueuer.start(max_queue_size=max_queue, workers=num_workers)
-            generator_output = None
-            while True:
-                while enqueuer.is_running():
-                    if not enqueuer.queue.empty():
-                        generator_output = enqueuer.queue.get()
-                        break
-                    else:
-                        time.sleep(0.01)
-                yield generator_output
-                generator_output = None
-        finally:
-            if enqueuer is not None:
-                enqueuer.stop()
-    return reader
+def train(settings, file_list, batch_size, shuffle=True, num_workers=8):
+    file_lists = load_file_list(file_list)
+    n = int(math.ceil(len(file_lists) // num_workers))
+    split_lists = [file_lists[i:i + n] for i in range(0, len(file_lists), n)]
+    readers = []
+    for iterm in split_lists:
+        readers.append(train_generator(settings, iterm, batch_size, shuffle))
+    return paddle.reader.multiprocess_reader(readers, False)
 def test(settings, file_list):
-    file_dict = load_file_list(file_list)
+    file_lists = load_file_list(file_list)
    def reader():
-        for index_image in file_dict.keys():
+        for image in file_lists:
-            image_name = file_dict[index_image][0]
+            image_name = image[0]
            image_path = os.path.join(settings.data_dir, image_name)
            im = Image.open(image_path)
            if im.mode == 'L':

--- a/fluid/PaddleCV/face_detection/train.py
+++ b/fluid/PaddleCV/face_detection/train.py
@@ -32,6 +32,9 @@ add_arg('mean_BGR',         str,   '104., 117., 123.', "Mean value for B,G,R cha
 add_arg('with_mem_opt',     bool,  True,            "Whether to use memory optimization or not.")
 add_arg('pretrained_model', str,   './vgg_ilsvrc_16_fc_reduced/', "The init model path.")
 add_arg('data_dir',         str,   'data',          "The base dir of dataset")
+parser.add_argument('--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
+parser.add_argument('--batch_num', type=int, help="batch num for ce")
+parser.add_argument('--num_devices', type=int, default=1, help='Number of GPU devices')
 #yapf: enable
 train_parameters = {
@@ -119,6 +122,16 @@ def train(args, config, train_params, train_file_list):
    startup_prog = fluid.Program()
    train_prog = fluid.Program()
+    #only for ce
+    if args.enable_ce:
+        SEED = 102
+        startup_prog.random_seed = SEED
+        train_prog.random_seed = SEED
+        num_workers = 1
+        pretrained_model = ""
+        if args.batch_num != None:
+            iters_per_epoc = args.batch_num
    train_py_reader, fetches, loss = build_program(
        train_params = train_params,
        main_prog = train_prog,
@@ -150,9 +163,7 @@ def train(args, config, train_params, train_file_list):
                                train_file_list,
                                batch_size_per_device,
                                shuffle = is_shuffle,
-                                use_multiprocessing=True,
+                                num_workers = num_workers)
-                                num_workers = num_workers,
-                                max_queue=24)
    train_py_reader.decorate_paddle_reader(train_reader)
    if args.parallel:
@@ -169,42 +180,69 @@ def train(args, config, train_params, train_file_list):
        print('save models to %s' % (model_path))
        fluid.io.save_persistables(exe, model_path, main_program=program)
-    train_py_reader.start()
+    total_time = 0.0
-    try:
+    epoch_idx = 0
-        for pass_id in range(start_epoc, epoc_num):
+    face_loss = 0
-            start_time = time.time()
+    head_loss = 0
-            prev_start_time = start_time
+    for pass_id in range(start_epoc, epoc_num):
-            end_time = 0
+        epoch_idx += 1
-            batch_id = 0
+        start_time = time.time()
-            for batch_id in range(iters_per_epoc):
+        prev_start_time = start_time
+        end_time = 0
+        batch_id = 0
+        train_py_reader.start()
+        while True:
+            try:
                prev_start_time = start_time
                start_time = time.time()
                if args.parallel:
                    fetch_vars = train_exe.run(fetch_list=
                        [v.name for v in fetches])
                else:
-                    fetch_vars = exe.run(train_prog,
+                    fetch_vars = exe.run(train_prog, fetch_list=fetches)
-                                         fetch_list=fetches)
                end_time = time.time()
                fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
+                face_loss = fetch_vars[0]
+                head_loss = fetch_vars[1]
                if batch_id % 10 == 0:
                    if not args.use_pyramidbox:
                        print("Pass {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
-                            pass_id, batch_id, fetch_vars[0],
+                            pass_id, batch_id, face_loss,
                            start_time - prev_start_time))
                    else:
                        print("Pass {:d}, batch {:d}, face loss {:.6f}, " \
                              "head loss {:.6f}, " \
                              "time {:.5f}".format(pass_id,
-                               batch_id, fetch_vars[0], fetch_vars[1],
+                               batch_id, face_loss, head_loss,
                               start_time - prev_start_time))
-            if pass_id % 1 == 0 or pass_id == epoc_num - 1:
+                batch_id += 1
-                save_model(str(pass_id), train_prog)
+            except (fluid.core.EOFException, StopIteration):
-    except fluid.core.EOFException:
+                train_py_reader.reset()
-        train_py_reader.reset()
+                break
-    except StopIteration:
+        epoch_end_time = time.time()
-        train_py_reader.reset()
+        total_time += epoch_end_time - start_time
-    train_py_reader.reset()
+        save_model(str(pass_id), train_prog)
+    # only for ce
+    if args.enable_ce:
+        gpu_num = get_cards(args)
+        print("kpis\teach_pass_duration_card%s\t%s" %
+                (gpu_num, total_time / epoch_idx))
+        print("kpis\ttrain_face_loss_card%s\t%s" %
+                (gpu_num, face_loss))
+        print("kpis\ttrain_head_loss_card%s\t%s" %
+                (gpu_num, head_loss))
+def get_cards(args):
+    if args.enable_ce:
+        cards = os.environ.get('CUDA_VISIBLE_DEVICES')
+        num = len(cards.split(","))
+        return num
+    else:
+        return args.num_devices
 if __name__ == '__main__':
    args = parser.parse_args()

--- a/fluid/PaddleCV/faster_rcnn/.run_ce.sh
+++ b/fluid/PaddleCV/faster_rcnn/.run_ce.sh
+#!/bin/bash
+export MKL_NUM_THREADS=1
+export OMP_NUM_THREADS=1
+cudaid=${face_detection:=0} # use 0-th card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py --model_save_dir=output/ --data_dir=dataset/coco/ --max_iter=10 --enable_ce --pretrained_model=./imagenet_resnet50_fusebn | python _ce.py
+cudaid=${face_detection_m:=0,1,2,3} # use 0,1,2,3 card as default
+export CUDA_VISIBLE_DEVICES=$cudaid
+FLAGS_benchmark=true  python train.py --model_save_dir=output/ --data_dir=dataset/coco/ --max_iter=10 --enable_ce --pretrained_model=./imagenet_resnet50_fusebn | python _ce.py
--- a/fluid/PaddleCV/faster_rcnn/README.md
+++ b/fluid/PaddleCV/faster_rcnn/README.md
@@ -38,18 +38,6 @@ Train the model on [MS-COCO dataset](http://cocodataset.org/#download), download
 ## Training
-After data preparation, one can start the training step by:
-    python train.py \
-       --model_save_dir=output/ \
-       --pretrained_model=${path_to_pretrain_model}
-       --data_dir=${path_to_data}
- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
- For more help on arguments:
-    python train.py --help
 **download the pre-trained model:** This sample provides Resnet-50 pre-trained model which is converted from Caffe. The model fuses the parameters in batch normalization layer. One can download pre-trained model as:
    sh ./pretrained/download.sh
@@ -72,6 +60,18 @@ To train the model, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
    # not to install the COCO API into global site-packages
    python2 setup.py install --user
+After data preparation, one can start the training step by:
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model}
+       --data_dir=${path_to_data}
+- Set ```export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7``` to specifiy 8 GPU to train.
+- For more help on arguments:
+    python train.py --help
 **data reader introduction:**
 * Data reader is defined in `reader.py`.
@@ -128,7 +128,7 @@ Inference is used to get prediction score or image features based on trained mod
    python infer.py \
       --dataset=coco2017 \
        --pretrained_model=${path_to_pretrain_model}  \
-        --image_path=data/COCO17/val2017/  \
+        --image_path=dataset/coco/val2017/  \
        --image_name=000000000139.jpg \
        --draw_threshold=0.6

--- a/fluid/PaddleCV/faster_rcnn/README_cn.md
+++ b/fluid/PaddleCV/faster_rcnn/README_cn.md
@@ -37,18 +37,6 @@ Faster RCNN 目标检测模型
 ## 模型训练
-数据准备完毕后，可以通过如下的方式启动训练：
-    python train.py \
-       --model_save_dir=output/ \
-       --pretrained_model=${path_to_pretrain_model}
-       --data_dir=${path_to_data}
- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
- 可选参数见：
-    python train.py --help
 **下载预训练模型：** 本示例提供Resnet-50预训练模型，该模性转换自Caffe，并对批标准化层(Batch Normalization Layer)进行参数融合。采用如下命令下载预训练模型：
    sh ./pretrained/download.sh
@@ -71,6 +59,18 @@ Faster RCNN 目标检测模型
    # not to install the COCO API into global site-packages
    python2 setup.py install --user
+数据准备完毕后，可以通过如下的方式启动训练：
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model}
+       --data_dir=${path_to_data}
+- 通过设置export CUDA\_VISIBLE\_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
+- 可选参数见：
+    python train.py --help
 **数据读取器说明：** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`，若长边大于`max_size`, 则再次将长边等比例缩放至`max_size`。在训练阶段，对图像采用水平翻转。支持将同一个batch内的图像padding为相同尺寸。
 **模型设置：**
@@ -124,7 +124,7 @@ Faster RCNN 目标检测模型
    python infer.py \
       --dataset=coco2017 \
        --pretrained_model=${path_to_pretrain_model}  \
-        --image_path=data/COCO17/val2017/  \
+        --image_path=dataset/coco/val2017/  \
        --image_name=000000000139.jpg \
        --draw_threshold=0.6

--- a/fluid/PaddleCV/faster_rcnn/__init__.py
+++ b/fluid/PaddleCV/faster_rcnn/__init__.py
--- a/fluid/PaddleCV/faster_rcnn/_ce.py
+++ b/fluid/PaddleCV/faster_rcnn/_ce.py
+# this file is only used for continuous evaluation test!
+import os
+import sys
+sys.path.append(os.environ['ceroot'])
+from kpi import CostKpi
+from kpi import DurationKpi
+each_pass_duration_card1_kpi = DurationKpi('each_pass_duration_card1', 0.08, 0, actived=True)
+train_loss_card1_kpi = CostKpi('train_loss_card1', 0.08, 0)
+each_pass_duration_card4_kpi = DurationKpi('each_pass_duration_card4', 0.08, 0, actived=True)
+train_loss_card4_kpi = CostKpi('train_loss_card4', 0.08, 0)
+tracking_kpis = [
+        each_pass_duration_card1_kpi,
+        train_loss_card1_kpi,
+        each_pass_duration_card4_kpi,
+        train_loss_card4_kpi,
+        ]
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+    The suggestion:
+    each line in the log should be key, value, for example:
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    log_to_ce(log)
--- a/fluid/PaddleCV/faster_rcnn/data_utils.py
+++ b/fluid/PaddleCV/faster_rcnn/data_utils.py
@@ -28,6 +28,7 @@ from __future__ import unicode_literals
 import cv2
 import numpy as np
 from config import cfg
+import os
 def get_image_blob(roidb, mode):
@@ -43,8 +44,11 @@ def get_image_blob(roidb, mode):
        target_size = cfg.TEST.scales[0]
        max_size = cfg.TEST.max_size
    im = cv2.imread(roidb['image'])
-    assert im is not None, \
+    try:
-        'Failed to read image \'{}\''.format(roidb['image'])
+        assert im is not None
+    except AssertionError as e:
+        print('Failed to read image \'{}\''.format(roidb['image']))
+        os._exit(0)
    if roidb['flipped']:
        im = im[:, ::-1, :]
    im, im_scale = prep_im_for_blob(im, cfg.pixel_means, target_size, max_size)

--- a/fluid/PaddleCV/faster_rcnn/train.py
+++ b/fluid/PaddleCV/faster_rcnn/train.py
@@ -35,7 +35,7 @@ def train():
    learning_rate = cfg.learning_rate
    image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
-    if cfg.debug:
+    if cfg.debug or cfg.enable_ce:
        fluid.default_startup_program().random_seed = 1000
        fluid.default_main_program().random_seed = 1000
        import random
@@ -46,11 +46,14 @@ def train():
    devices_num = len(devices.split(","))
    total_batch_size = devices_num * cfg.TRAIN.im_per_batch
+    use_random = True
+    if cfg.enable_ce:
+        use_random = False
    model = model_builder.FasterRCNN(
        add_conv_body_func=resnet.add_ResNet50_conv4_body,
        add_roi_box_head_func=resnet.add_ResNet_roi_conv5_head,
        use_pyreader=cfg.use_pyreader,
-        use_random=True)
+        use_random=use_random)
    model.build_model(image_shape)
    loss_cls, loss_bbox, rpn_cls_loss, rpn_reg_loss = model.loss()
    loss_cls.persistable = True
@@ -92,16 +95,19 @@ def train():
        train_exe = fluid.ParallelExecutor(
            use_cuda=bool(cfg.use_gpu), loss_name=loss.name)
+    shuffle = True
+    if cfg.enable_ce:
+        shuffle = False
    if cfg.use_pyreader:
        train_reader = reader.train(
            batch_size=cfg.TRAIN.im_per_batch,
            total_batch_size=total_batch_size,
            padding_total=cfg.TRAIN.padding_minibatch,
-            shuffle=True)
+            shuffle=shuffle)
        py_reader = model.py_reader
        py_reader.decorate_paddle_reader(train_reader)
    else:
-        train_reader = reader.train(batch_size=total_batch_size, shuffle=True)
+        train_reader = reader.train(batch_size=total_batch_size, shuffle=shuffle)
        feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
    def save_model(postfix):
@@ -118,6 +124,8 @@ def train():
        try:
            start_time = time.time()
            prev_start_time = start_time
+            total_time = 0
+            last_loss = 0
            every_pass_loss = []
            for iter_id in range(cfg.max_iter):
                prev_start_time = start_time
@@ -131,9 +139,23 @@ def train():
                    iter_id, lr[0],
                    smoothed_loss.get_median_value(
                    ), start_time - prev_start_time))
+                end_time = time.time()
+                total_time += end_time - start_time
+                last_loss = np.mean(np.array(losses[0]))
                sys.stdout.flush()
                if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0:
                    save_model("model_iter{}".format(iter_id))
+            # only for ce
+            if cfg.enable_ce:
+                gpu_num = devices_num
+                epoch_idx = iter_id + 1
+                loss = last_loss
+                print("kpis\teach_pass_duration_card%s\t%s" %
+                        (gpu_num, total_time / epoch_idx))
+                print("kpis\ttrain_loss_card%s\t%s" %
+                        (gpu_num, loss))
        except fluid.core.EOFException:
            py_reader.reset()
        return np.mean(every_pass_loss)
@@ -142,6 +164,8 @@ def train():
        start_time = time.time()
        prev_start_time = start_time
        start = start_time
+        total_time = 0
+        last_loss = 0
        every_pass_loss = []
        smoothed_loss = SmoothedValue(cfg.log_window)
        for iter_id, data in enumerate(train_reader()):
@@ -154,6 +178,9 @@ def train():
            smoothed_loss.add_value(loss_v)
            lr = np.array(fluid.global_scope().find_var('learning_rate')
                          .get_tensor())
+            end_time = time.time()
+            total_time += end_time - start_time
+            last_loss = loss_v
            print("Iter {:d}, lr {:.6f}, loss {:.6f}, time {:.5f}".format(
                iter_id, lr[0],
                smoothed_loss.get_median_value(), start_time - prev_start_time))
@@ -162,6 +189,16 @@ def train():
                save_model("model_iter{}".format(iter_id))
            if (iter_id + 1) == cfg.max_iter:
                break
+        # only for ce
+        if cfg.enable_ce:
+            gpu_num = devices_num
+            epoch_idx = iter_id + 1
+            loss = last_loss
+            print("kpis\teach_pass_duration_card%s\t%s" %
+                    (gpu_num, total_time / epoch_idx))
+            print("kpis\ttrain_loss_card%s\t%s" %
+                    (gpu_num, loss))
        return np.mean(every_pass_loss)
    if cfg.use_pyreader:

--- a/fluid/PaddleCV/faster_rcnn/utility.py
+++ b/fluid/PaddleCV/faster_rcnn/utility.py
@@ -98,7 +98,7 @@ def parse_args():
    add_arg('pretrained_model', str,    'imagenet_resnet50_fusebn', "The init model path.")
    add_arg('dataset',          str,   'coco2017',  "coco2014, coco2017.")
    add_arg('class_num',        int,   81,          "Class number.")
-    add_arg('data_dir',         str,   'data/COCO17',        "The data root path.")
+    add_arg('data_dir',         str,   'dataset/coco',        "The data root path.")
    add_arg('use_pyreader',     bool,   True,           "Use pyreader.")
    add_arg('use_profile',         bool,   False,       "Whether use profiler.")
    add_arg('padding_minibatch',bool,   False,
@@ -127,8 +127,11 @@ def parse_args():
    add_arg('debug',            bool,   False,   "Debug mode")
    # SINGLE EVAL AND DRAW
    add_arg('draw_threshold',  float, 0.8,    "Confidence threshold to draw bbox.")
-    add_arg('image_path',       str,   'data/COCO17/val2017',  "The image path used to inference and visualize.")
+    add_arg('image_path',       str,   'dataset/coco/val2017',  "The image path used to inference and visualize.")
    add_arg('image_name',        str,    '',       "The single image used to inference and visualize.")
+    # ce
+    parser.add_argument(
+            '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
    # yapf: enable
    args = parser.parse_args()
    file_name = sys.argv[0]

--- a/fluid/PaddleCV/gan/c_gan/.run_ce.sh
+++ b/fluid/PaddleCV/gan/c_gan/.run_ce.sh
@@ -3,7 +3,7 @@
 # This file is only used for continuous evaluation.
 export FLAGS_cudnn_deterministic=True
 export ce_mode=1
-(CUDA_VISIBLE_DEVICES=6 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
+(CUDA_VISIBLE_DEVICES=2 python c_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True & \
-CUDA_VISIBLE_DEVICES=7 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
+CUDA_VISIBLE_DEVICES=3 python dc_gan.py --batch_size=121 --epoch=1 --run_ce=True --use_gpu=True) | python _ce.py
--- a/fluid/PaddleCV/gan/c_gan/c_gan.py
+++ b/fluid/PaddleCV/gan/c_gan/c_gan.py
@@ -165,7 +165,8 @@ def train(args):
                          'conditions': conditions_data},
                    fetch_list={dg_loss})[0][0]
                losses[1].append(dg_loss_n)
-            t_time += (time.time() - s_time)
+            batch_time = time.time() - s_time
+            t_time += batch_time
@@ -180,8 +181,9 @@ def train(args):
                    fetch_list={g_img})[0]
                total_images = np.concatenate([real_image, generated_images])
                fig = plot(total_images)
-                msg = "Epoch ID={0}\n Batch ID={1}\n D-Loss={2}\n DG-Loss={3}\n gen={4}".format(
+                msg = "Epoch ID={0}\n Batch ID={1}\n D-Loss={2}\n DG-Loss={3}\n gen={4}\n " \
-                    pass_id, batch_id, d_loss_n, dg_loss_n, check(generated_images))
+                      "Batch_time_cost={5:.2f}".format(
+                    pass_id, batch_id, d_loss_n, dg_loss_n, check(generated_images), batch_time)
                print(msg)
                plt.title(msg)
                plt.savefig(

--- a/fluid/PaddleCV/gan/cycle_gan/train.py
+++ b/fluid/PaddleCV/gan/cycle_gan/train.py
@@ -187,10 +187,12 @@ def train(args):
                fetch_list=[d_A_trainer.d_loss_A],
                feed={"input_A": tensor_A,
                      "fake_pool_A": fake_pool_A})[0]
-            t_time += (time.time() - s_time)
+            batch_time = time.time() - s_time
-            print("epoch{}; batch{}; g_A_loss: {}; d_B_loss: {}; g_B_loss: {}; d_A_loss: {};".format(
+            t_time += batch_time
+            print("epoch{}; batch{}; g_A_loss: {}; d_B_loss: {}; g_B_loss: {}; d_A_loss: {}; "
+                  "Batch_time_cost: {:.2f}".format(
                epoch, batch_id, g_A_loss[0], d_B_loss[0], g_B_loss[0],
-                d_A_loss[0]))
+                d_A_loss[0], batch_time))
            losses[0].append(g_A_loss[0])
            losses[1].append(d_A_loss[0])
            sys.stdout.flush()

--- a/fluid/PaddleCV/image_classification/.run_ce.sh
+++ b/fluid/PaddleCV/image_classification/.run_ce.sh
@@ -7,6 +7,7 @@ cudaid=${object_detection_cudaid:=0}
 export CUDA_VISIBLE_DEVICES=$cudaid
 python train.py --batch_size=${BATCH_SIZE} --num_epochs=5 --enable_ce=True --lr_strategy=cosine_decay | python _ce.py
+BATCH_SIZE=224
 cudaid=${object_detection_cudaid_m:=0, 1, 2, 3}
 export CUDA_VISIBLE_DEVICES=$cudaid
 python train.py --batch_size=${BATCH_SIZE} --num_epochs=5 --enable_ce=True --lr_strategy=cosine_decay | python _ce.py
--- a/fluid/PaddleCV/image_classification/README.md
+++ b/fluid/PaddleCV/image_classification/README.md
@@ -6,6 +6,7 @@ Image classification, which is an important field of computer vision, is to clas
 - [Installation](#installation)
 - [Data preparation](#data-preparation)
 - [Training a model with flexible parameters](#training-a-model)
+- [Using Mixed-Precision Training](#using-mixed-precision-training)
 - [Finetuning](#finetuning)
 - [Evaluation](#evaluation)
 - [Inference](#inference)
@@ -112,6 +113,13 @@ The error rate curves of AlexNet, ResNet50 and SE-ResNeXt-50 are shown in the fi
 Training and validation Curves
 </p>
+## Using Mixed-Precision Training
+You may add `--fp16 1` to start train using mixed precisioin training, which the training process will use float16 and the output model ("master" parameters) is saved as float32. You also may need to pass `--scale_loss` to overcome accuracy issues, usually `--scale_loss 8.0` will do.
+Note that currently `--fp16` can not use together with `--with_mem_opt`, so pass `--with_mem_opt 0` to disable memory optimization pass.
 ## Finetuning
 Finetuning is to finetune model weights in a specific task by loading pretrained weights. After initializing ```path_to_pretrain_model```, one can finetune a model as:
@@ -196,10 +204,19 @@ Models are trained by starting with learning rate ```0.1``` and decaying it by `
 |model | top-1/top-5 accuracy(PIL)| top-1/top-5 accuracy(CV2) |
 |- |:-: |:-:|
 |[AlexNet](http://paddle-imagenet-models-name.bj.bcebos.com/AlexNet_pretrained.zip) | 56.71%/79.18% | 55.88%/78.65% |
-|[VGG11](http://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretained.zip) | 68.92%/88.66% | 68.61%/88.60% |
+|[VGG11](https://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretrained.zip) | 69.22%/89.09% | 69.01%/88.90% |
+|[VGG13](https://paddle-imagenet-models-name.bj.bcebos.com/VGG13_pretrained.zip) | 70.14%/89.48% | 69.83%/89.13% |
+|[VGG16](https://paddle-imagenet-models-name.bj.bcebos.com/VGG16_pretrained.zip) | 72.08%/90.63% | 71.65%/90.57% |
+|[VGG19](https://paddle-imagenet-models-name.bj.bcebos.com/VGG19_pretrained.zip) | 72.56%/90.83% | 72.32%/90.98% |
 |[MobileNetV1](http://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV1_pretrained.zip) | 70.91%/89.54% | 70.51%/89.35% |
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.zip) | 76.35%/92.80% | 76.22%/92.92% |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.zip) | 77.49%/93.57% | 77.56%/93.64% |
+|[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.zip) | 78.12%/93.93% | 77.92%/93.87% |
+|[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNext50_32x4d_pretrained.zip) | 78.50%/94.01% | 78.44%/93.96% |
+|[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.zip) | 79.26%/94.22% | 79.12%/94.20% |
 - Released models: not specify parameter names

--- a/fluid/PaddleCV/image_classification/README_cn.md
+++ b/fluid/PaddleCV/image_classification/README_cn.md
@@ -109,6 +109,11 @@ End pass 9, train_loss 3.3745200634, train_acc1 0.303871691227, train_acc5 0.545
 训练集合与验证集合上的错误率曲线
 </p>
+## 混合精度训练
+可以通过开启`--fp16 1`启动混合精度训练，这样训练过程会使用float16数据，并输出float32的模型参数（"master"参数）。您可能需要同时传入`--scale_loss`来解决fp16训练的精度问题，通常传入`--scale_loss 8.0`即可。
+注意，目前混合精度训练不能和内存优化功能同时使用，所以需要传`--with_mem_opt 0`这个参数来禁用内存优化功能。
 ## 参数微调
@@ -194,10 +199,16 @@ Models包括两种模型：带有参数名字的模型，和不带有参数名
 |model | top-1/top-5 accuracy(PIL)| top-1/top-5 accuracy(CV2) |
 |- |:-: |:-:|
 |[AlexNet](http://paddle-imagenet-models-name.bj.bcebos.com/AlexNet_pretrained.zip) | 56.71%/79.18% | 55.88%/78.65% |
-|[VGG11](http://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretained.zip) | 68.92%/88.66% | 68.61%/88.60% |
+|[VGG11](https://paddle-imagenet-models-name.bj.bcebos.com/VGG11_pretrained.zip) | 69.22%/89.09% | 69.01%/88.90% |
+|[VGG13](https://paddle-imagenet-models-name.bj.bcebos.com/VGG13_pretrained.zip) | 70.14%/89.48% | 69.83%/89.13% |
+|[VGG16](https://paddle-imagenet-models-name.bj.bcebos.com/VGG16_pretrained.zip) | 72.08%/90.63% | 71.65%/90.57% |
+|[VGG19](https://paddle-imagenet-models-name.bj.bcebos.com/VGG19_pretrained.zip) | 72.56%/90.83% | 72.32%/90.98% |
 |[MobileNetV1](http://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV1_pretrained.zip) | 70.91%/89.54% | 70.51%/89.35% |
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.zip) | 76.35%/92.80% | 76.22%/92.92% |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.zip) | 77.49%/93.57% | 77.56%/93.64% |
+|[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.zip) | 78.12%/93.93% | 77.92%/93.87% |
+|[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNext50_32x4d_pretrained.zip) | 78.50%/94.01% | 78.44%/93.96% |
+|[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.zip) | 79.26%/94.22% | 79.12%/94.20% |
 - Released models: not specify parameter names

--- a/fluid/PaddleCV/image_classification/dist_train/README.md
+++ b/fluid/PaddleCV/image_classification/dist_train/README.md
@@ -7,13 +7,15 @@ large-scaled distributed training with two distributed mode: parameter server mo
 Before getting started, please make sure you have go throught the imagenet [Data Preparation](../README.md#data-preparation).
-1. The entrypoint file is `dist_train.py`, some important flags are as follows:
+1. The entrypoint file is `dist_train.py`, the commandline arguments are almost the same as the original `train.py`, with the following arguments specific to distributed training.
-    - `model`, the model to run with, default is the fine tune model `DistResnet`.
-    - `batch_size`, the batch_size per device.
    - `update_method`, specify the update method, can choose from local, pserver or nccl2.
-    - `device`, use CPU or GPU device.
+    - `multi_batch_repeat`, set this greater than 1 to merge batches before pushing gradients to pservers.
-    - `gpus`, the GPU device count that the process used.
+    - `start_test_pass`, when to start running tests.
+    - `num_threads`, how many threads will be used for ParallelExecutor.
+    - `split_var`, in pserver mode, whether to split one parameter to several pservers, default True.
+    - `async_mode`, do async training, defalt False.
+    - `reduce_strategy`, choose from "reduce", "allreduce".
    you can check out more details of the flags by `python dist_train.py --help`.
@@ -21,66 +23,27 @@ Before getting started, please make sure you have go throught the imagenet [Data
    We use the environment variable to distinguish the different training role of a distributed training job.
-    - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
+    - General envs:
-    - `PADDLE_TRAINERS`, the trainer count of a job.
+        - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
-    - `PADDLE_CURRENT_IP`, the current instance IP.
+        - `PADDLE_TRAINERS_NUM`, the trainer count of a distributed job.
-    - `PADDLE_PSERVER_IPS`, the parameter server IP list, separated by ","  only be used with update_method is pserver.
+        - `PADDLE_CURRENT_ENDPOINT`, current process endpoint.
-    - `PADDLE_TRAINER_ID`, the unique trainer ID of a job, the ranging is [0, PADDLE_TRAINERS).
+    - Pserver mode:
-    - `PADDLE_PSERVER_PORT`, the port of the parameter pserver listened on.
+        - `PADDLE_TRAINING_ROLE`, the current training role, should be in [PSERVER, TRAINER].
-    - `PADDLE_TRAINER_IPS`, the trainer IP list, separated by ",", only be used with upadte_method is nccl2.
+        - `PADDLE_PSERVER_ENDPOINTS`, the parameter server endpoint list, separated by ",".
+    - NCCL2 mode:
-### Parameter Server Mode
+        - `PADDLE_TRAINER_ENDPOINTS`, endpoint list for each worker, separated by ",".
-In this example, we launched 4 parameter server instances and 4 trainer instances in the cluster:
+### Try Out Different Distributed Training Modes
-1. launch parameter server process
+You can test if distributed training works on a single node before deploying to the "real" cluster.
-    ``` bash
+***NOTE: for best performance, we recommend using multi-process mode, see No.3. And together with fp16.***
-    PADDLE_TRAINING_ROLE=PSERVER \
-    PADDLE_TRAINERS=4 \
+1. simply run `python dist_train.py` to start local training with default configuratioins.
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
+2. for pserver mode, run `bash run_ps_mode.sh` to start 2 pservers and 2 trainers, these 2 trainers
-    PADDLE_CURRENT_IP=192.168.0.100 \
+   will use GPU 0 and 1 to simulate 2 workers.
-    PADDLE_PSERVER_PORT=7164 \
+3. for nccl2 mode, run `bash run_nccl2_mode.sh` to start 2 workers.
-    python dist_train.py \
+4. for local/distributed multi-process mode, run `run_mp_mode.sh` (this test use 4 GPUs).
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=CPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-1. launch trainer process
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_PSERVER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    PADDLE_PSERVER_PORT=7164 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=pserver \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
-### NCCL2 Collective Mode
-1. launch trainer process
-    ``` bash
-    PADDLE_TRAINING_ROLE=TRAINER \
-    PADDLE_TRAINERS=4 \
-    PADDLE_TRAINER_IPS=192.168.0.100,192.168.0.101,192.168.0.102,192.168.0.103 \
-    PADDLE_TRAINER_ID=0 \
-    python dist_train.py \
-        --model=DistResnet \
-        --batch_size=32 \
-        --update_method=nccl2 \
-        --device=GPU \
-        --data_dir=../data/ILSVRC2012
-    ```
 ### Visualize the Training Process
@@ -88,16 +51,10 @@ It's easy to draw the learning curve accroding to the training logs, for example
 the logs of ResNet50 is as follows:
 ``` text
-Pass 0, batch 0, loss 7.0336914, accucacys: [0.0, 0.00390625]
+Pass 0, batch 30, loss 7.569439, acc1: 0.0125, acc5: 0.0125, avg batch time 0.1720
-Pass 0, batch 1, loss 7.094781, accucacys: [0.0, 0.0]
+Pass 0, batch 60, loss 7.027379, acc1: 0.0, acc5: 0.0, avg batch time 0.1551
-Pass 0, batch 2, loss 7.007068, accucacys: [0.0, 0.0078125]
+Pass 0, batch 90, loss 6.819984, acc1: 0.0, acc5: 0.0125, avg batch time 0.1492
-Pass 0, batch 3, loss 7.1056547, accucacys: [0.00390625, 0.00390625]
+Pass 0, batch 120, loss 6.9076853, acc1: 0.0, acc5: 0.0125, avg batch time 0.1464
-Pass 0, batch 4, loss 7.133543, accucacys: [0.0, 0.0078125]
-Pass 0, batch 5, loss 7.3055463, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 6, loss 7.341838, accucacys: [0.0078125, 0.01171875]
-Pass 0, batch 7, loss 7.290557, accucacys: [0.0, 0.0]
-Pass 0, batch 8, loss 7.264951, accucacys: [0.0, 0.00390625]
-Pass 0, batch 9, loss 7.43522, accucacys: [0.00390625, 0.00390625]
 ```
 The below figure shows top 1 train accuracy for local training with 8 GPUs and distributed training

--- a/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
+++ b/fluid/PaddleCV/image_classification/dist_train/batch_merge.py
+import paddle.fluid as fluid
+def copyback_repeat_bn_params(main_prog):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    for vname in repeat_vars:
+        real_var = fluid.global_scope().find_var("%s.repeat.0" % vname).get_tensor()
+        orig_var = fluid.global_scope().find_var(vname).get_tensor()
+        orig_var.set(np.array(real_var), fluid.CUDAPlace(0)) # test on GPU0
+def append_bn_repeat_init_op(main_prog, startup_prog, num_repeats):
+    repeat_vars = set()
+    for op in main_prog.global_block().ops:
+        if op.type == "batch_norm":
+            repeat_vars.add(op.input("Mean")[0])
+            repeat_vars.add(op.input("Variance")[0])
+    for i in range(num_repeats):
+        for op in startup_prog.global_block().ops:
+            if op.type == "fill_constant":
+                for oname in op.output_arg_names:
+                    if oname in repeat_vars:
+                        var = startup_prog.global_block().var(oname)
+                        repeat_var_name = "%s.repeat.%d" % (oname, i)
+                        repeat_var = startup_prog.global_block().create_var(
+                            name=repeat_var_name,
+                            type=var.type,
+                            dtype=var.dtype,
+                            shape=var.shape,
+                            persistable=var.persistable
+                        )
+                        main_prog.global_block()._clone_variable(repeat_var)
+                        startup_prog.global_block().append_op(
+                            type="fill_constant",
+                            inputs={},
+                            outputs={"Out": repeat_var},
+                            attrs=op.all_attrs()
+                        )
--- a/fluid/PaddleCV/image_classification/dist_train/dist_train.py
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_train.py
--- a/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
+++ b/fluid/PaddleCV/image_classification/dist_train/dist_utils.py
+import os
+import paddle.fluid as fluid
+def nccl2_prepare(args, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    t.transpile(envs["trainer_id"],
+        trainers=','.join(envs["trainer_endpoints"]),
+        current_endpoint=envs["current_endpoint"],
+        startup_program=startup_prog)
+def pserver_prepare(args, train_prog, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.slice_var_up = args.split_var
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    training_role = envs["training_role"]
+    t.transpile(
+        envs["trainer_id"],
+        program=train_prog,
+        pservers=envs["pserver_endpoints"],
+        trainers=envs["num_trainers"],
+        sync_mode=not args.async_mode,
+        startup_program=startup_prog)
+    if training_role == "PSERVER":
+        pserver_program = t.get_pserver_program(envs["current_endpoint"])
+        pserver_startup_program = t.get_startup_program(
+            envs["current_endpoint"], pserver_program, startup_program=startup_prog)
+        return pserver_program, pserver_startup_program
+    elif training_role == "TRAINER":
+        train_program = t.get_trainer_program()
+        return train_program, startup_prog
+    else:
+        raise ValueError(
+            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
+        )
--- a/fluid/PaddleCV/image_classification/dist_train/env.py
+++ b/fluid/PaddleCV/image_classification/dist_train/env.py
+import os
+def dist_env():
+    """
+    Return a dict of all variable that distributed training may use.
+    NOTE: you may rewrite this function to suit your cluster environments.
+    """
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    num_trainers = 1
+    training_role = os.getenv("PADDLE_TRAINING_ROLE", "TRAINER")
+    assert(training_role == "PSERVER" or training_role == "TRAINER")
+    # - PADDLE_TRAINER_ENDPOINTS means nccl2 mode.
+    # - PADDLE_PSERVER_ENDPOINTS means pserver mode.
+    # - PADDLE_CURRENT_ENDPOINT means current process endpoint.
+    trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+    pserver_endpoints = os.getenv("PADDLE_PSERVER_ENDPOINTS")
+    current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+    if trainer_endpoints:
+        trainer_endpoints = trainer_endpoints.split(",")
+        num_trainers = len(trainer_endpoints)
+    elif pserver_endpoints:
+        num_trainers = int(os.getenv("PADDLE_TRAINERS_NUM"))
+    return {
+        "trainer_id": trainer_id,
+        "num_trainers": num_trainers,
+        "current_endpoint": current_endpoint,
+        "training_role": training_role,
+        "pserver_endpoints": pserver_endpoints,
+        "trainer_endpoints": trainer_endpoints
+    }
--- a/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_mp_mode.sh
+#!/bin/bash
+# Test using 4 GPUs
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161,127.0.0.1:7162,127.0.0.1:7163"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="4"
+mkdir -p logs
+for i in {0..3}
+do
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:716${i}" \
+PADDLE_TRAINER_ID="${i}" \
+FLAGS_selected_gpus="${i}" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 --fp16 1 --scale_loss 8 &> logs/tr$i.log &
+done
--- a/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_nccl2_mode.sh
+#!/bin/bash
+export MODEL="DistResNet"
+export PADDLE_TRAINER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+# PADDLE_TRAINERS_NUM is used only for reader when nccl2 mode
+export PADDLE_TRAINERS_NUM="2"
+mkdir -p logs
+# NOTE: set NCCL_P2P_DISABLE so that can run nccl2 distribute train on one node.
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr0.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+NCCL_P2P_DISABLE="1" \
+python dist_train.py --model $MODEL --update_method nccl2 --batch_size 32 &> logs/tr1.log &
--- a/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
+++ b/fluid/PaddleCV/image_classification/dist_train/run_ps_mode.sh
+#!/bin/bash
+export MODEL="DistResNet"
+export PADDLE_PSERVER_ENDPOINTS="127.0.0.1:7160,127.0.0.1:7161"
+export PADDLE_TRAINERS_NUM="2"
+mkdir -p logs
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps0.log &
+PADDLE_TRAINING_ROLE="PSERVER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/ps1.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7160" \
+PADDLE_TRAINER_ID="0" \
+CUDA_VISIBLE_DEVICES="0" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr0.log &
+PADDLE_TRAINING_ROLE="TRAINER" \
+PADDLE_CURRENT_ENDPOINT="127.0.0.1:7161" \
+PADDLE_TRAINER_ID="1" \
+CUDA_VISIBLE_DEVICES="1" \
+python dist_train.py --model $MODEL --update_method pserver --batch_size 32 &> logs/tr1.log &
--- a/fluid/PaddleCV/image_classification/eval.py
+++ b/fluid/PaddleCV/image_classification/eval.py
@@ -7,12 +7,13 @@ import time
 import sys
 import paddle
 import paddle.fluid as fluid
-import models
+#import models
+import models_name as models
 #import reader_cv2 as reader
 import reader as reader
 import argparse
 import functools
-from models.learning_rate import cosine_decay
+from utils.learning_rate import cosine_decay
 from utility import add_arguments, print_arguments
 import math
@@ -48,7 +49,7 @@ def eval(args):
    # model definition
    model = models.__dict__[model_name]()
-    if model_name is "GoogleNet":
+    if model_name == "GoogleNet":
        out0, out1, out2 = model.net(input=image, class_dim=class_dim)
        cost0 = fluid.layers.cross_entropy(input=out0, label=label)
        cost1 = fluid.layers.cross_entropy(input=out1, label=label)
@@ -70,8 +71,10 @@ def eval(args):
    test_program = fluid.default_main_program().clone(for_test=True)
+    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
    if with_memory_optimization:
-        fluid.memory_optimize(fluid.default_main_program())
+        fluid.memory_optimize(
+            fluid.default_main_program(), skip_opt_set=set(fetch_list))
    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
@@ -84,11 +87,9 @@ def eval(args):
        fluid.io.load_vars(exe, pretrained_model, predicate=if_exist)
-    val_reader = paddle.batch(reader.val(""), batch_size=args.batch_size)
+    val_reader = paddle.batch(reader.val(), batch_size=args.batch_size)
    feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
-    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
    test_info = [[], [], []]
    cnt = 0
    for batch_id, data in enumerate(val_reader()):

--- a/fluid/PaddleCV/image_classification/infer.py
+++ b/fluid/PaddleCV/image_classification/infer.py
@@ -11,7 +11,6 @@ import models
 import reader
 import argparse
 import functools
-from models.learning_rate import cosine_decay
 from utility import add_arguments, print_arguments
 import math
@@ -44,7 +43,6 @@ def infer(args):
    # model definition
    model = models.__dict__[model_name]()
    if model_name is "GoogleNet":
        out, _, _ = model.net(input=image, class_dim=class_dim)
    else:
@@ -52,8 +50,10 @@ def infer(args):
    test_program = fluid.default_main_program().clone(for_test=True)
+    fetch_list = [out.name]
    if with_memory_optimization:
-        fluid.memory_optimize(fluid.default_main_program())
+        fluid.memory_optimize(
+            fluid.default_main_program(), skip_opt_set=set(fetch_list))
    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
@@ -70,8 +70,6 @@ def infer(args):
    test_reader = paddle.batch(reader.test(), batch_size=test_batch_size)
    feeder = fluid.DataFeeder(place=place, feed_list=[image])
-    fetch_list = [out.name]
    TOPK = 1
    for batch_id, data in enumerate(test_reader()):
        result = exe.run(test_program,

--- a/fluid/PaddleCV/image_classification/models/alexnet.py
+++ b/fluid/PaddleCV/image_classification/models/alexnet.py
@@ -142,7 +142,6 @@ class AlexNet():
        out = fluid.layers.fc(
            input=fc7,
            size=class_dim,
-            act='softmax',
            bias_attr=fluid.param_attr.ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv)),
            param_attr=fluid.param_attr.ParamAttr(

--- a/fluid/PaddleCV/image_classification/models/dpn.py
+++ b/fluid/PaddleCV/image_classification/models/dpn.py
@@ -94,7 +94,6 @@ class DPN(object):
            initializer=fluid.initializer.Uniform(-stdv, stdv))
        fc6 = fluid.layers.fc(input=pool5,
                              size=class_dim,
-                              act='softmax',
                              param_attr=param_attr)
        return fc6

--- a/fluid/PaddleCV/image_classification/models/inception_v4.py
+++ b/fluid/PaddleCV/image_classification/models/inception_v4.py
@@ -47,7 +47,6 @@ class InceptionV4():
        out = fluid.layers.fc(
            input=drop,
            size=class_dim,
-            act='softmax',
            param_attr=fluid.param_attr.ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv)))
        return out

--- a/fluid/PaddleCV/image_classification/models/mobilenet.py
+++ b/fluid/PaddleCV/image_classification/models/mobilenet.py
@@ -120,7 +120,6 @@ class MobileNet():
        output = fluid.layers.fc(input=input,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(initializer=MSRA()))
        return output

--- a/fluid/PaddleCV/image_classification/models/mobilenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models/mobilenet_v2.py
@@ -73,7 +73,6 @@ class MobileNetV2():
        output = fluid.layers.fc(input=input,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(initializer=MSRA()))
        return output

--- a/fluid/PaddleCV/image_classification/models/resnet.py
+++ b/fluid/PaddleCV/image_classification/models/resnet.py
@@ -60,7 +60,6 @@ class ResNet():
        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
        out = fluid.layers.fc(input=pool,
                              size=class_dim,
-                              act='softmax',
                              param_attr=fluid.param_attr.ParamAttr(
                                  initializer=fluid.initializer.Uniform(-stdv,
                                                                        stdv)))

--- a/fluid/PaddleCV/image_classification/models/resnet_dist.py
+++ b/fluid/PaddleCV/image_classification/models/resnet_dist.py
@@ -14,8 +14,9 @@ train_parameters = {
    "learning_strategy": {
        "name": "piecewise_decay",
        "batch_size": 256,
-        "epochs": [30, 60, 90],
+        "epochs": [30, 60, 80],
-        "steps": [0.1, 0.01, 0.001, 0.0001]
+        "steps": [0.1, 0.01, 0.001, 0.0001],
+        "warmup_passes": 5
    }
 }
@@ -62,7 +63,6 @@ class DistResNet():
        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
        out = fluid.layers.fc(input=pool,
                              size=class_dim,
-                              act='softmax',
                              param_attr=fluid.param_attr.ParamAttr(
                                  initializer=fluid.initializer.Uniform(-stdv,
                                                                        stdv),
@@ -119,3 +119,4 @@ class DistResNet():
        short = self.shortcut(input, num_filters * 4, stride)
        return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
--- a/fluid/PaddleCV/image_classification/models/se_resnext.py
+++ b/fluid/PaddleCV/image_classification/models/se_resnext.py
@@ -110,7 +110,6 @@ class SE_ResNeXt():
        stdv = 1.0 / math.sqrt(drop.shape[1] * 1.0)
        out = fluid.layers.fc(input=drop,
                              size=class_dim,
-                              act='softmax',
                              param_attr=fluid.param_attr.ParamAttr(
                                  initializer=fluid.initializer.Uniform(-stdv,
                                                                        stdv)))

--- a/fluid/PaddleCV/image_classification/models/shufflenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models/shufflenet_v2.py
@@ -93,7 +93,6 @@ class ShuffleNetV2():
        output = fluid.layers.fc(input=pool_last,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(initializer=MSRA()))
        return output

--- a/fluid/PaddleCV/image_classification/models/vgg.py
+++ b/fluid/PaddleCV/image_classification/models/vgg.py
@@ -64,7 +64,6 @@ class VGGNet():
        out = fluid.layers.fc(
            input=fc2,
            size=class_dim,
-            act='softmax',
            param_attr=fluid.param_attr.ParamAttr(
                initializer=fluid.initializer.Normal(scale=0.005)),
            bias_attr=fluid.param_attr.ParamAttr(

--- a/fluid/PaddleCV/image_classification/models_name/alexnet.py
+++ b/fluid/PaddleCV/image_classification/models_name/alexnet.py
@@ -159,7 +159,6 @@ class AlexNet():
        out = fluid.layers.fc(
            input=fc7,
            size=class_dim,
-            act='softmax',
            bias_attr=fluid.param_attr.ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv),
                name=layer_name[7] + "_offset"),

--- a/fluid/PaddleCV/image_classification/models_name/dpn.py
+++ b/fluid/PaddleCV/image_classification/models_name/dpn.py
@@ -122,7 +122,6 @@ class DPN(object):
            initializer=fluid.initializer.Uniform(-stdv, stdv))
        fc6 = fluid.layers.fc(input=pool5,
                              size=class_dim,
-                              act='softmax',
                              param_attr=param_attr,
                              name="fc6")

--- a/fluid/PaddleCV/image_classification/models_name/inception_v4.py
+++ b/fluid/PaddleCV/image_classification/models_name/inception_v4.py
@@ -48,7 +48,6 @@ class InceptionV4():
        out = fluid.layers.fc(
            input=drop,
            size=class_dim,
-            act='softmax',
            param_attr=ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv),
                name="final_fc_weights"),

--- a/fluid/PaddleCV/image_classification/models_name/mobilenet.py
+++ b/fluid/PaddleCV/image_classification/models_name/mobilenet.py
@@ -130,7 +130,6 @@ class MobileNet():
        output = fluid.layers.fc(input=input,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(
                                     initializer=MSRA(), name="fc7_weights"),
                                 bias_attr=ParamAttr(name="fc7_offset"))

--- a/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models_name/mobilenet_v2.py
@@ -80,7 +80,6 @@ class MobileNetV2():
        output = fluid.layers.fc(input=input,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(name='fc10_weights'),
                                 bias_attr=ParamAttr(name='fc10_offset'))
        return output

--- a/fluid/PaddleCV/image_classification/models_name/resnet.py
+++ b/fluid/PaddleCV/image_classification/models_name/resnet.py
@@ -74,7 +74,6 @@ class ResNet():
        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
        out = fluid.layers.fc(input=pool,
                              size=class_dim,
-                              act='softmax',
                              param_attr=fluid.param_attr.ParamAttr(
                                  initializer=fluid.initializer.Uniform(-stdv,
                                                                        stdv)))

--- a/fluid/PaddleCV/image_classification/models_name/se_resnext.py
+++ b/fluid/PaddleCV/image_classification/models_name/se_resnext.py
@@ -123,7 +123,6 @@ class SE_ResNeXt():
        out = fluid.layers.fc(
            input=drop,
            size=class_dim,
-            act='softmax',
            param_attr=ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv),
                name='fc6_weights'),

--- a/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py
+++ b/fluid/PaddleCV/image_classification/models_name/shufflenet_v2.py
@@ -97,7 +97,6 @@ class ShuffleNetV2():
        output = fluid.layers.fc(input=pool_last,
                                 size=class_dim,
-                                 act='softmax',
                                 param_attr=ParamAttr(
                                     initializer=MSRA(), name='fc6_weights'),
                                 bias_attr=ParamAttr(name='fc6_offset'))

--- a/fluid/PaddleCV/image_classification/models_name/vgg.py
+++ b/fluid/PaddleCV/image_classification/models_name/vgg.py
@@ -61,7 +61,6 @@ class VGGNet():
        out = fluid.layers.fc(
            input=fc2,
            size=class_dim,
-            act='softmax',
            param_attr=fluid.param_attr.ParamAttr(name=fc_name[2] + "_weights"),
            bias_attr=fluid.param_attr.ParamAttr(name=fc_name[2] + "_offset"))

--- a/fluid/PaddleCV/image_classification/reader.py
+++ b/fluid/PaddleCV/image_classification/reader.py
@@ -130,16 +130,19 @@ def _reader_creator(file_list,
                    shuffle=False,
                    color_jitter=False,
                    rotate=False,
-                    data_dir=DATA_DIR):
+                    data_dir=DATA_DIR,
+                    pass_id_as_seed=0):
    def reader():
        with open(file_list) as flist:
            full_lines = [line.strip() for line in flist]
            if shuffle:
+                if pass_id_as_seed:
+                    np.random.seed(pass_id_as_seed)
                np.random.shuffle(full_lines)
            if mode == 'train' and os.getenv('PADDLE_TRAINING_ROLE'):
                # distributed mode if the env var `PADDLE_TRAINING_ROLE` exits
                trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
-                trainer_count = int(os.getenv("PADDLE_TRAINERS", "1"))
+                trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
                per_node_lines = len(full_lines) // trainer_count
                lines = full_lines[trainer_id * per_node_lines:(trainer_id + 1)
                                   * per_node_lines]
@@ -166,7 +169,7 @@ def _reader_creator(file_list,
    return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
-def train(data_dir=DATA_DIR):
+def train(data_dir=DATA_DIR, pass_id_as_seed=0):
    file_list = os.path.join(data_dir, 'train_list.txt')
    return _reader_creator(
        file_list,
@@ -174,7 +177,8 @@ def train(data_dir=DATA_DIR):
        shuffle=True,
        color_jitter=False,
        rotate=False,
-        data_dir=data_dir)
+        data_dir=data_dir,
+        pass_id_as_seed=pass_id_as_seed)
 def val(data_dir=DATA_DIR):

--- a/fluid/PaddleCV/image_classification/reader_cv2.py
+++ b/fluid/PaddleCV/image_classification/reader_cv2.py
@@ -101,8 +101,6 @@ def process_image(sample,
    std = [0.229, 0.224, 0.225] if std is None else std
    img_path = sample[0]
-    print('&' * 80)
-    print(img_path)
    img = cv2.imread(img_path)
    if mode == 'train':

--- a/fluid/PaddleCV/image_classification/run.sh
+++ b/fluid/PaddleCV/image_classification/run.sh
@@ -78,3 +78,58 @@ python train.py \
 #	--num_epochs=120 \
 #       --lr=0.1
+#ResNet152:
+#python train.py \
+#       --model=ResNet152 \
+#       --batch_size=256 \
+#       --total_images=1281167 \
+#       --image_shape=3,224,224 \
+#       --lr_strategy=piecewise_decay \
+#       --lr=0.1 \
+#       --num_epochs=120 \
+#       --l2_decay=1e-4 \(TODO)
+#SE_ResNeXt50:
+#python train.py \
+#       --model=SE_ResNeXt50 \
+#       --batch_size=400 \
+#       --total_images=1281167 \
+#       --image_shape=3,224,224 \
+#       --lr_strategy=cosine_decay \
+#       --lr=0.1 \
+#       --num_epochs=200 \
+#       --l2_decay=12e-5 \(TODO)
+#SE_ResNeXt101:
+#python train.py \
+#        --model=SE_ResNeXt101 \
+#        --batch_size=400 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.1 \
+#        --num_epochs=200 \
+#        --l2_decay=15e-5 \(TODO)
+#VGG11:
+#python train.py \
+#        --model=VGG11 \
+#        --batch_size=512 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.1 \
+#        --num_epochs=90 \
+#        --l2_decay=2e-4 \(TODO)
+#VGG13:
+#python train.py
+#        --model=VGG13 \          
+#        --batch_size=256 \
+#        --total_images=1281167 \
+#        --image_shape=3,224,224 \
+#        --lr_strategy=cosine_decay \
+#        --lr=0.01 \
+#        --num_epochs=90 \
+#        --l2_decay=3e-4 \(TODO)
--- a/fluid/PaddleCV/image_classification/train.py
+++ b/fluid/PaddleCV/image_classification/train.py
@@ -17,6 +17,7 @@ import functools
 import subprocess
 import utils
 from utils.learning_rate import cosine_decay
+from utils.fp16_utils import create_master_params_grads, master_param_to_train_param
 from utility import add_arguments, print_arguments
 import models
 import models_name
@@ -40,7 +41,9 @@ add_arg('model',            str,   "SE_ResNeXt50_32x4d", "Set the network to use
 add_arg('enable_ce',        bool,  False,                "If set True, enable continuous evaluation job.")
 add_arg('data_dir',         str,   "./data/ILSVRC2012",  "The ImageNet dataset root dir.")
 add_arg('model_category',   str,   "models",             "Whether to use models_name or not, valid value:'models','models_name'" )
-# yapf: enabl
+add_arg('fp16',             bool,  False,                "Enable half precision training with fp16." )
+add_arg('scale_loss',       float, 1.0,                  "Scale loss for fp16." )
+# yapf: enable
 def set_models(model):
@@ -145,12 +148,15 @@ def net_config(image, label, model, args):
        acc_top1 = fluid.layers.accuracy(input=out0, label=label, k=1)
        acc_top5 = fluid.layers.accuracy(input=out0, label=label, k=5)
    else:
-        out = model.net(input=image, class_dim=class_dim)
+        out = model.net(input=image, class_dim=class_dim)    
-        cost = fluid.layers.cross_entropy(input=out, label=label)
+        cost, pred = fluid.layers.softmax_with_cross_entropy(out, label, return_softmax=True) 
+        if args.scale_loss > 1:
+            avg_cost = fluid.layers.mean(x=cost) * float(args.scale_loss)
+        else:
+            avg_cost = fluid.layers.mean(x=cost)
-        avg_cost = fluid.layers.mean(x=cost)
+        acc_top1 = fluid.layers.accuracy(input=pred, label=label, k=1)
-        acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
+        acc_top5 = fluid.layers.accuracy(input=pred, label=label, k=5)
-        acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
    return avg_cost, acc_top1, acc_top5
@@ -171,6 +177,8 @@ def build_program(is_train, main_prog, startup_prog, args):
            use_double_buffer=True)
        with fluid.unique_name.guard():
            image, label = fluid.layers.read_file(py_reader)
+            if args.fp16:
+                image = fluid.layers.cast(image, "float16")
            avg_cost, acc_top1, acc_top5 = net_config(image, label, model, args)
            avg_cost.persistable = True
            acc_top1.persistable = True
@@ -184,7 +192,15 @@ def build_program(is_train, main_prog, startup_prog, args):
                params["learning_strategy"]["name"] = args.lr_strategy
                optimizer = optimizer_setting(params)
-                optimizer.minimize(avg_cost)
+                if args.fp16:
+                    params_grads = optimizer.backward(avg_cost)
+                    master_params_grads = create_master_params_grads(
+                        params_grads, main_prog, startup_prog, args.scale_loss)
+                    optimizer.apply_gradients(master_params_grads)
+                    master_param_to_train_param(master_params_grads, params_grads, main_prog)
+                else:
+                    optimizer.minimize(avg_cost)
    return py_reader, avg_cost, acc_top1, acc_top5
@@ -200,7 +216,6 @@ def train(args):
    startup_prog = fluid.Program()
    train_prog = fluid.Program()
    test_prog = fluid.Program()
    if args.enable_ce:
        startup_prog.random_seed = 1000
        train_prog.random_seed = 1000
@@ -240,10 +255,10 @@ def train(args):
    if visible_device:
        device_num = len(visible_device.split(','))
    else:
-        device_num = subprocess.check_output(['nvidia-smi', '-L']).count('\n')
+        device_num = subprocess.check_output(['nvidia-smi', '-L']).decode().count('\n')
    train_batch_size = args.batch_size / device_num
-    test_batch_size = 8
+    test_batch_size = 16
    if not args.enable_ce:
        train_reader = paddle.batch(
            reader.train(), batch_size=train_batch_size, drop_last=True)
@@ -307,7 +322,7 @@ def train(args):
        train_loss = np.array(train_info[0]).mean()
        train_acc1 = np.array(train_info[1]).mean()
        train_acc5 = np.array(train_info[2]).mean()
-        train_speed = np.array(train_time).mean() / train_batch_size
+        train_speed = np.array(train_time).mean() / (train_batch_size * device_num)
        test_py_reader.start()

--- a/fluid/PaddleCV/image_classification/utils/__init__.py
+++ b/fluid/PaddleCV/image_classification/utils/__init__.py
 from .learning_rate import cosine_decay, lr_warmup
+from .fp16_utils import create_master_params_grads, master_param_to_train_param
--- a/fluid/PaddleCV/image_classification/utils/fp16_utils.py
+++ b/fluid/PaddleCV/image_classification/utils/fp16_utils.py
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        }
+    )
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        }
+    )
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+def create_master_params_grads(params_grads, main_prog, startup_prog, scale_loss):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.startswith("batch_norm"):
+            if scale_loss > 1:
+                scaled_g = g / float(scale_loss)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if scale_loss > 1:
+            master_grad = master_grad / float(scale_loss)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.startswith("batch_norm"):
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
--- a/fluid/PaddleCV/metric_learning/README.md
+++ b/fluid/PaddleCV/metric_learning/README.md
 # Deep Metric Learning
-Metric learning is a kind of methods to learn discriminative features for each sample, with the purpose that intra-class samples have smaller distances while inter-class samples have larger distances in the learned space. With the develop of deep learning technique, metric learning methods are combined with deep neural networks to boost the performance of traditional tasks, such as face recognition/verification, human re-identification, image retrieval and so on. In this page, we introduce the way to implement deep metric learning using PaddlePaddle Fluid, including [data preparation](#data-preparation), [training](#training-a-model), [finetuning](#finetuning), [evaluation](#evaluation) and [inference](#inference).
+Metric learning is a kind of methods to learn discriminative features for each sample, with the purpose that intra-class samples have smaller distances while inter-class samples have larger distances in the learned space. With the develop of deep learning technique, metric learning methods are combined with deep neural networks to boost the performance of traditional tasks, such as face recognition/verification, human re-identification, image retrieval and so on. In this page, we introduce the way to implement deep metric learning using PaddlePaddle Fluid, including [data preparation](#data-preparation), [training](#training-metric-learning-models), [finetuning](#finetuning), [evaluation](#evaluation), [inference](#inference) and [Performances](#performances).
 ---
 ## Table of Contents
 - [Installation](#installation)
 - [Data preparation](#data-preparation)
- [Training metric learning models](#training-a-model)
+- [Training metric learning models](#training-metric-learning-models)
 - [Finetuning](#finetuning)
 - [Evaluation](#evaluation)
 - [Inference](#inference)
- [Performances](#supported-models)
+- [Performances](#performances)
 ## Installation
@@ -17,7 +17,7 @@ Running sample code in this directory requires PaddelPaddle Fluid v0.14.0 and la
 ## Data preparation
-Stanford Online Product(SOP) dataset contains 120,053 images of 22,634 products downloaded from eBay.com. We use it to conduct the metric learning experiments. For training, 59,5511 out of 11,318 classes are used, and 11,316 classes(60,502 images) are held out for testing. First of all, preparation of SOP data can be done as:
+Stanford Online Product(SOP) dataset contains 120,053 images of 22,634 products downloaded from eBay.com. We use it to conduct the metric learning experiments. For training, 59,551 out of 11,318 classes are used, and 11,316 classes(60,502 images) are held out for testing. First of all, preparation of SOP data can be done as:
 ```
 cd data/
 sh download_sop.sh
@@ -25,7 +25,7 @@ sh download_sop.sh
 ## Training metric learning models
-To train a metric learning model, one need to set the neural network as backbone and the metric loss function to optimize. We train meiric learning model using softmax or [arcmargin](https://arxiv.org/abs/1801.07698) loss firstly, and then fine-turned the model using other metric learning loss, such as triplet, [quadruplet](https://arxiv.org/abs/1710.00478) and [eml](https://arxiv.org/abs/1212.6094) loss. One example of training using arcmargin loss is shown below:
+To train a metric learning model, one need to set the neural network as backbone and the metric loss function to optimize. We train meiric learning model using softmax or arcmargin loss firstly, and then fine-turned the model using other metric learning loss, such as triplet, quadruplet and eml loss. One example of training using arcmargin loss is shown below:
 ```
@@ -52,7 +52,7 @@ python train_elem.py  \
 * **use_gpu**: whether to use GPU or not. Default: True.
 * **pretrained_model**: model path for pretraining. Default: None.
 * **model_save_dir**: the directory to save trained model. Default: "output".
-* **loss_name**: loss fortraining model. Default: "softmax".
+* **loss_name**: loss for training model. Default: "softmax".
 * **arc_scale**: parameter of arcmargin loss. Default: 80.0.
 * **arc_margin**: parameter of arcmargin loss. Default: 0.15.
 * **arc_easy_margin**: parameter of arcmargin loss. Default: False.
@@ -103,3 +103,9 @@ For comparation, many metric learning models with different neural networks and
 |fine-tuned with triplet | 78.37% | 79.21%
 |fine-tuned with quadruplet | 78.10% | 79.59%
 |fine-tuned with eml | 79.32% | 80.11%
+## Reference
+- ArcFace: Additive Angular Margin Loss for Deep Face Recognition [link](https://arxiv.org/abs/1801.07698)
+- Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification [link](https://arxiv.org/abs/1710.00478)
+- Large Scale Strongly Supervised Ensemble Metric Learning, with Applications to Face Verification and Retrieval [link](https://arxiv.org/abs/1212.6094)
--- a/fluid/PaddleCV/metric_learning/README_cn.md
+++ b/fluid/PaddleCV/metric_learning/README_cn.md
+# 深度度量学习
+度量学习是一种为样本对学习具有区分性特征的方法，目的是在特征空间中，让同一个类别的样本具有较小的特征距离，不同类的样本具有较大的特征距离。随着深度学习技术的发展，基于深度神经网络的度量学习方法已经在许多视觉任务上提升了很大的性能，例如：人脸识别、人脸校验、行人重识别和图像检索等等。在本章节，介绍在PaddlePaddle Fluid里实现的几种度量学习方法和使用方法，具体包括[数据准备](#数据准备)，[模型训练](#模型训练)，[模型微调](#模型微调)，[模型评估](#模型评估)，[模型预测](#模型预测)。
+---
+## 简介
+- [安装](#安装)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型微调](#模型微调)
+- [模型评估](#模型评估)
+- [模型预测](#模型预测)
+- [模型性能](#模型性能)
+## 安装
+运行本章节代码需要在PaddlePaddle Fluid v0.14.0 或更高的版本环境。如果你的设备上的PaddlePaddle版本低于v0.14.0，请按照此[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)进行安装和跟新。
+## 数据准备
+Stanford Online Product(SOP) 数据集下载自eBay，包含120053张商品图片，有22634个类别。我们使用该数据集进行实验。训练时，使用59551张图片，11318个类别的数据；测试时，使用60502张图片，11316个类别。首先，SOP数据集可以使用以下脚本下载：
+```
+cd data/
+sh download_sop.sh
+```
+## 模型训练 
+为了训练度量学习模型，我们需要一个神经网络模型作为骨架模型（如ResNet50）和度量学习代价函数来进行优化。我们首先使用 softmax 或者 arcmargin 来进行训练，然后使用其它的代价函数来进行微调，例如：triplet，quadruplet和eml。下面是一个使用arcmargin训练的例子：
+```
+python train_elem.py  \
+        --model=ResNet50 \
+        --train_batch_size=256 \
+        --test_batch_size=50 \
+        --lr=0.01 \
+        --total_iter_num=30000 \
+        --use_gpu=True \
+        --pretrained_model=${path_to_pretrain_imagenet_model} \
+        --model_save_dir=${output_model_path} \
+        --loss_name=arcmargin \
+        --arc_scale=80.0 \ 
+        --arc_margin=0.15 \
+        --arc_easy_margin=False
+```
+**参数介绍:**
+* **model**: 使用的模型名字. 默认: "ResNet50".
+* **train_batch_size**: 训练的 mini-batch大小. 默认: 256.
+* **test_batch_size**: 测试的 mini-batch大小. 默认: 50.
+* **lr**: 初始学习率. 默认: 0.01.
+* **total_iter_num**: 总的训练迭代轮数. 默认: 30000.
+* **use_gpu**: 是否使用GPU. 默认: True.
+* **pretrained_model**: 预训练模型的路径. 默认: None.
+* **model_save_dir**: 保存模型的路径. 默认: "output".
+* **loss_name**: 优化的代价函数. 默认: "softmax".
+* **arc_scale**: arcmargin的参数. 默认: 80.0.
+* **arc_margin**: arcmargin的参数. 默认: 0.15.
+* **arc_easy_margin**: arcmargin的参数. 默认: False.
+## 模型微调
+网络微调是在指定的任务上加载已有的模型来微调网络。在用softmax和arcmargin训完网络后，可以继续使用triplet，quadruplet或eml来微调网络。下面是一个使用eml来微调网络的例子：
+```
+python train_pair.py  \
+        --model=ResNet50 \
+        --train_batch_size=160 \
+        --test_batch_size=50 \
+        --lr=0.0001 \
+        --total_iter_num=100000 \
+        --use_gpu=True \
+        --pretrained_model=${path_to_pretrain_arcmargin_model} \
+        --model_save_dir=${output_model_path} \
+        --loss_name=eml \
+        --samples_each_class=2
+```
+## 模型评估
+模型评估主要是评估模型的检索性能。这里需要设置```path_to_pretrain_model```。可以使用下面命令来计算Recall@Rank-1。
+```
+python eval.py \
+       --model=ResNet50 \
+       --batch_size=50 \
+       --pretrained_model=${path_to_pretrain_model} \
+```
+## 模型预测
+模型预测主要是基于训练好的网络来获取图像数据的特征，下面是模型预测的例子：
+```
+python infer.py \
+       --model=ResNet50 \
+       --batch_size=1 \         
+       --pretrained_model=${path_to_pretrain_model}
+```
+## 模型性能
+下面列举了几种度量学习的代价函数在SOP数据集上的检索效果，这里使用Recall@Rank-1来进行评估。
+|预训练模型 | softmax | arcmargin
+|- | - | -:
+|未微调 | 77.42% | 78.11%
+|使用triplet微调 | 78.37% | 79.21%
+|使用quadruplet微调 | 78.10% | 79.59%
+|使用eml微调 | 79.32% | 80.11%
+## 引用
+- ArcFace: Additive Angular Margin Loss for Deep Face Recognition [链接](https://arxiv.org/abs/1801.07698)
+- Margin Sample Mining Loss: A Deep Learning Based Method for Person Re-identification [链接](https://arxiv.org/abs/1710.00478)
+- Large Scale Strongly Supervised Ensemble Metric Learning, with Applications to Face Verification and Retrieval [链接](https://arxiv.org/abs/1212.6094)
--- a/fluid/PaddleCV/metric_learning/reader.py
+++ b/fluid/PaddleCV/metric_learning/reader.py
@@ -63,6 +63,7 @@ def common_iterator(data, settings):
    assert (batch_size % samples_each_class == 0)
    class_num = batch_size // samples_each_class 
    def train_iterator():
+        count = 0
        labs = list(data.keys())
        lab_num = len(labs)
        ind = list(range(0, lab_num))
@@ -79,6 +80,9 @@ def common_iterator(data, settings):
                for anchor_ind_i in anchor_ind:
                    anchor_path = DATA_DIR + data_list[anchor_ind_i]
                    yield anchor_path, lab
+            count += 1
+            if count >= settings.total_iter_num + 1:
+                return
    return train_iterator
@@ -86,6 +90,8 @@ def triplet_iterator(data, settings):
    batch_size = settings.train_batch_size
    assert (batch_size % 3 == 0)
    def train_iterator():
+        total_count = settings.train_batch_size * (settings.total_iter_num + 1)
+        count = 0
        labs = list(data.keys())
        lab_num = len(labs)
        ind = list(range(0, lab_num))
@@ -108,16 +114,24 @@ def triplet_iterator(data, settings):
            yield pos_path, lab_pos
            neg_path = DATA_DIR + neg_data_list[neg_ind]
            yield neg_path, lab_neg
+            count += 3
+            if count >= total_count:
+                return
    return train_iterator
 def arcmargin_iterator(data, settings):
    def train_iterator():
+        total_count = settings.train_batch_size * (settings.total_iter_num + 1)
+        count = 0
        while True:
            for items in data:
                path, label = items
                path = DATA_DIR + path
                yield path, label
+                count += 1
+                if count >= total_count:
+                    return
    return train_iterator
 def image_iterator(data, mode):

--- a/fluid/PaddleCV/object_detection/README.md
+++ b/fluid/PaddleCV/object_detection/README.md
@@ -21,9 +21,7 @@ SSD is readily pluggable into a wide variant standard convolutional network, suc
 ### Data Preparation
-You can use [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) or [MS-COCO dataset](http://cocodataset.org/#download).
+Please download [PASCAL VOC dataset](http://host.robots.ox.ac.uk/pascal/VOC/) at first, skip this step if you already have one.
-If you want to train a model on PASCAL VOC dataset, please download dataset at first, skip this step if you already have one.
 ```bash
 cd data/pascalvoc
@@ -32,30 +30,18 @@ cd data/pascalvoc
 The command `download.sh` also will create training and testing file lists.
-If you want to train a model on MS-COCO dataset, please download dataset at first, skip this step if you already have one.
-```
-cd data/coco
-./download.sh
-```
 ### Train
 #### Download the Pre-trained Model.
-We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer.
+We provide two pre-trained models. The one is MobileNet-v1 SSD trained on COCO dataset, but removed the convolutional predictors for COCO dataset. This model can be used to initialize the models when training other datasets, like PASCAL VOC. The other pre-trained model is MobileNet-v1 trained on ImageNet 2012 dataset but removed the last weights and bias in the Fully-Connected layer. Download MobileNet-v1 SSD:
-Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md). The MobileNet-v1 model is converted from [Caffe](https://github.com/shicai/MobileNet-Caffe).
-We will release the pre-trained models by ourself in the upcoming soon.
-  - Download MobileNet-v1 SSD:
    ```bash
    ./pretrained/download_coco.sh
    ```
-  - Download MobileNet-v1:
-    ```bash
+Declaration: the MobileNet-v1 SSD model is converted by [TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md).
-    ./pretrained/download_imagenet.sh
-    ```
 #### Train on PASCAL VOC
@@ -64,7 +50,6 @@ We will release the pre-trained models by ourself in the upcoming soon.
  python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
  ```
   - Set ```export CUDA_VISIBLE_DEVICES=0,1``` to specifiy the number of GPU you want to use.
-   - Set ```--dataset='coco2014'``` or ```--dataset='coco2017'``` to train model on MS COCO dataset.
   - For more help on arguments:
  ```bash
@@ -88,19 +73,6 @@ You can evaluate your trained model in different metrics like 11point, integral
 python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
 ```
-You can set ```--dataset``` to ```coco2014``` or ```coco2017``` to evaluate COCO dataset. Moreover, we provide `eval_coco_map.py` which uses a COCO-specific mAP metric defined by [COCO committee](http://cocodataset.org/#detections-eval). To use this eval_coco_map.py, [cocoapi](https://github.com/cocodataset/cocoapi) is needed.
-Install the cocoapi:
-```
-# COCOAPI=/path/to/clone/cocoapi
-git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
-cd $COCOAPI/PythonAPI
-# Install into global site-packages
-make install
-# Alternatively, if you do not have permissions or prefer
-# not to install the COCO API into global site-packages
-python2 setup.py install --user
-```
 ### Infer and Visualize
 `infer.py` is the main caller of the inferring module. Examples of usage are shown below.
 ```bash

--- a/fluid/PaddleCV/object_detection/README_cn.md
+++ b/fluid/PaddleCV/object_detection/README_cn.md
@@ -21,9 +21,8 @@ SSD 可以方便地插入到任何一种标准卷积网络中，比如 VGG、Res
 ### 数据准备
-你可以使用 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/) 或者 [MS-COCO 数据集](http://cocodataset.org/#download)。
-如果你想在 PASCAL VOC 数据集上进行训练，请先使用下面的命令下载数据集。
+请先使用下面的命令下载 [PASCAL VOC 数据集](http://host.robots.ox.ac.uk/pascal/VOC/)：
 ```bash
 cd data/pascalvoc
@@ -32,29 +31,19 @@ cd data/pascalvoc
 `download.sh` 命令会自动创建训练和测试用的列表文件。
-如果你想在 MS-COCO 数据集上进行训练，请先使用下面的命令下载数据集。
-```
-cd data/coco
-./download.sh
-```
 ### 模型训练
 #### 下载预训练模型
-我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD，我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1，我们也将最后的全连接层移除以便进行目标检测训练。
+我们提供了两个预训练模型。第一个模型是在 COCO 数据集上预训练的 MobileNet-v1 SSD，我们将它的预测头移除了以便在 COCO 以外的数据集上进行训练。第二个模型是在 ImageNet 2012 数据集上预训练的 MobileNet-v1，我们也将最后的全连接层移除以便进行目标检测训练。下载 MobileNet-v1 SSD:
-声明：MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)。我们不久也会发布我们自己预训练的模型。
-  - 下载 MobileNet-v1 SSD:
    ```bash
    ./pretrained/download_coco.sh
    ```
-  - 下载 MobileNet-v1:
-    ```bash
+声明：MobileNet-v1 SSD 模型转换自[TensorFlow model](https://github.com/tensorflow/models/blob/f87a58cd96d45de73c9a8330a06b2ab56749a7fa/research/object_detection/g3doc/detection_model_zoo.md)。MobileNet-v1 模型转换自[Caffe](https://github.com/shicai/MobileNet-Caffe)。
-    ./pretrained/download_imagenet.sh
-    ```
 #### 训练
@@ -63,7 +52,6 @@ cd data/coco
  python -u train.py --batch_size=64 --dataset='pascalvoc' --pretrained_model='pretrained/ssd_mobilenet_v1_coco/'
  ```
   - 可以通过设置 ```export CUDA_VISIBLE_DEVICES=0,1``` 指定想要使用的GPU数量。
-   - 可以通过设置 ```--dataset='coco2014'``` 或 ```--dataset='coco2017'``` 指定训练 MS-COCO数据集。
   - 更多的可选参数见:
  ```bash
@@ -80,25 +68,13 @@ cd data/coco
 ### 模型评估
-你可以使用11point、integral等指标在PASCAL VOC 和 COCO 数据集上评估训练好的模型。不失一般性，我们采用相应数据集的测试列表作为样例代码的默认列表，你也可以通过设置```--test_list```来指定自己的测试样本列表。
+你可以使用11point、integral等指标在PASCAL VOC 数据集上评估训练好的模型。不失一般性，我们采用相应数据集的测试列表作为样例代码的默认列表，你也可以通过设置```--test_list```来指定自己的测试样本列表。
 `eval.py`是评估模块的主要执行程序，调用示例如下：
 ```bash
 python eval.py --dataset='pascalvoc' --model_dir='train_pascal_model/best_model' --data_dir='data/pascalvoc' --test_list='test.txt' --ap_version='11point' --nms_threshold=0.45
 ```
-你可以设置```--dataset``` 为 ```coco2014``` 或 ```coco2017```来评估 COCO 数据集。我们也提供了`eval_coco_map.py`以进行[COCO官方评估](http://cocodataset.org/#detections-eval)。若要使用 eval_coco_map.py, 需要首先下载[cocoapi](https://github.com/cocodataset/cocoapi)：
-```
-# COCOAPI=/path/to/clone/cocoapi
-git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
-cd $COCOAPI/PythonAPI
-# Install into global site-packages
-make install
-# Alternatively, if you do not have permissions or prefer
-# not to install the COCO API into global site-packages
-python2 setup.py install --user
-```
 ### 模型预测以及可视化
 `infer.py`是预测及可视化模块的主要执行程序，调用示例如下：

--- a/fluid/PaddleCV/object_detection/README_quant.md
+++ b/fluid/PaddleCV/object_detection/README_quant.md
@@ -2,7 +2,7 @@
 ### Introduction
-The quantization-aware training used in this experiments is introduced in [fixed-point quantization desigin](https://gthub.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/quantization/fixed_point_quantization.md). Since quantization-aware training is still an active area of research and experimentation,
+The quantization-aware training used in this experiments is introduced in [fixed-point quantization desigin](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/quantization/fixed_point_quantization.md). Since quantization-aware training is still an active area of research and experimentation,
 here, we just give an simple quantization training usage in Fluid based on MobileNet-SSD model, and more other exeperiments are still needed, like how to quantization traning by considering fusing batch normalization and convolution/fully-connected layers, channel-wise quantization of weights and so on.
@@ -130,6 +130,9 @@ A Python transpiler is used to rewrite Fluid training program or evaluation prog
  ```
  See 002271.jpg for the visualized image with bbouding boxes.
+  **Note**, if you want to convert model to 8-bit, you should call `fluid.contrib.QuantizeTranspiler.convert_to_int8` to do this. But, now Paddle can't load 8-bit model to do inference.
 ### Results
 Results of MobileNet-v1-SSD 300x300 model on PascalVOC dataset.

--- a/fluid/PaddleCV/object_detection/_ce.py
+++ b/fluid/PaddleCV/object_detection/_ce.py
@@ -9,10 +9,10 @@ from kpi import CostKpi, DurationKpi, AccKpi
 train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
 test_acc_kpi = AccKpi('test_acc', 0.01, 0, actived=False)
-train_speed_kpi = AccKpi('train_speed', 0.1, 0, actived=True)
+train_speed_kpi = DurationKpi('train_speed', 0.1, 0, actived=True, unit_repr="s/epoch")
 train_cost_card4_kpi = CostKpi('train_cost_card4', 0.02, 0, actived=True)
 test_acc_card4_kpi = AccKpi('test_acc_card4', 0.01, 0, actived=False)
-train_speed_card4_kpi = AccKpi('train_speed_card4', 0.1, 0, actived=True)
+train_speed_card4_kpi = DurationKpi('train_speed_card4', 0.1, 0, actived=True, unit_repr="s/epoch")
 tracking_kpis = [
    train_cost_kpi,

--- a/fluid/PaddleCV/object_detection/data_util.py
+++ b/fluid/PaddleCV/object_detection/data_util.py
-"""
-This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
-"""
-import time
-import numpy as np
-import threading
-import multiprocessing
-try:
-    import queue
-except ImportError:
-    import Queue as queue
-class GeneratorEnqueuer(object):
-    """
-    Builds a queue out of a data generator.
-    Args:
-        generator: a generator function which endlessly yields data
-        use_multiprocessing (bool): use multiprocessing if True,
-            otherwise use threading.
-        wait_time (float): time to sleep in-between calls to `put()`.
-        random_seed (int): Initial seed for workers,
-            will be incremented by one for each workers.
-    """
-    def __init__(self,
-                 generator,
-                 use_multiprocessing=False,
-                 wait_time=0.05,
-                 random_seed=None):
-        self.wait_time = wait_time
-        self._generator = generator
-        self._use_multiprocessing = use_multiprocessing
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-        self._manager = None
-        self.seed = random_seed
-    def start(self, workers=1, max_queue_size=10):
-        """
-        Start worker threads which add data from the generator into the queue.
-        Args:
-            workers (int): number of worker threads
-            max_queue_size (int): queue size
-                (when full, threads could block on `put()`)
-        """
-        def data_generator_task():
-            """
-            Data generator task.
-            """
-            def task():
-                if (self.queue is not None and
-                        self.queue.qsize() < max_queue_size):
-                    generator_output = next(self._generator)
-                    self.queue.put((generator_output))
-                else:
-                    time.sleep(self.wait_time)
-            if not self._use_multiprocessing:
-                while not self._stop_event.is_set():
-                    with self.genlock:
-                        try:
-                            task()
-                        except Exception:
-                            self._stop_event.set()
-                            break
-            else:
-                while not self._stop_event.is_set():
-                    try:
-                        task()
-                    except Exception:
-                        self._stop_event.set()
-                        break
-        try:
-            if self._use_multiprocessing:
-                self._manager = multiprocessing.Manager()
-                self.queue = self._manager.Queue(maxsize=max_queue_size)
-                self._stop_event = multiprocessing.Event()
-            else:
-                self.genlock = threading.Lock()
-                self.queue = queue.Queue()
-                self._stop_event = threading.Event()
-            for _ in range(workers):
-                if self._use_multiprocessing:
-                    # Reset random seed else all children processes
-                    # share the same seed
-                    np.random.seed(self.seed)
-                    thread = multiprocessing.Process(target=data_generator_task)
-                    thread.daemon = True
-                    if self.seed is not None:
-                        self.seed += 1
-                else:
-                    thread = threading.Thread(target=data_generator_task)
-                self._threads.append(thread)
-                thread.start()
-        except:
-            self.stop()
-            raise
-    def is_running(self):
-        """
-        Returns:
-            bool: Whether the worker theads are running.
-        """
-        return self._stop_event is not None and not self._stop_event.is_set()
-    def stop(self, timeout=None):
-        """
-        Stops running threads and wait for them to exit, if necessary.
-        Should be called by the same thread which called `start()`.
-        Args:
-            timeout(int|None): maximum time to wait on `thread.join()`.
-        """
-        if self.is_running():
-            self._stop_event.set()
-        for thread in self._threads:
-            if self._use_multiprocessing:
-                if thread.is_alive():
-                    thread.terminate()
-            else:
-                thread.join(timeout)
-        if self._manager:
-            self._manager.shutdown()
-        self._threads = []
-        self._stop_event = None
-        self.queue = None
-    def get(self):
-        """
-        Creates a generator to extract data from the queue.
-        Skip the data if it is `None`.
-        # Yields
-            tuple of data in the queue.
-        """
-        while self.is_running():
-            if not self.queue.empty():
-                inputs = self.queue.get()
-                if inputs is not None:
-                    yield inputs
-            else:
-                time.sleep(self.wait_time)
--- a/fluid/PaddleCV/object_detection/eval.py
+++ b/fluid/PaddleCV/object_detection/eval.py
@@ -52,7 +52,7 @@ def build_program(main_prog, startup_prog, args, data_args):
            nmsed_out = fluid.layers.detection_output(
                locs, confs, box, box_var, nms_threshold=args.nms_threshold)
            with fluid.program_guard(main_prog):
-                map = fluid.evaluator.DetectionMAP(
+                map = fluid.metrics.DetectionMAP(
                    nmsed_out,
                    gt_label,
                    gt_box,

--- a/fluid/PaddleCV/object_detection/eval_coco_map.py
+++ b/fluid/PaddleCV/object_detection/eval_coco_map.py
@@ -47,7 +47,7 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
    gt_iscrowd = fluid.layers.data(
        name='gt_iscrowd', shape=[1], dtype='int32', lod_level=1)
    gt_image_info = fluid.layers.data(
-        name='gt_image_id', shape=[3], dtype='int32', lod_level=1)
+        name='gt_image_id', shape=[3], dtype='int32')
    locs, confs, box, box_var = mobile_net(num_classes, image, image_shape)
    nmsed_out = fluid.layers.detection_output(
@@ -57,14 +57,14 @@ def eval(args, data_args, test_list, batch_size, model_dir=None):
    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
    # yapf: disable
    if model_dir:
        def if_exist(var):
            return os.path.exists(os.path.join(model_dir, var.name))
        fluid.io.load_vars(exe, model_dir, predicate=if_exist)
    # yapf: enable
-    test_reader = paddle.batch(
+    test_reader = reader.test(data_args, test_list, batch_size)
-        reader.test(data_args, test_list), batch_size=batch_size)
    feeder = fluid.DataFeeder(
        place=place,
        feed_list=[image, gt_box, gt_label, gt_iscrowd, gt_image_info])
@@ -146,8 +146,7 @@ if __name__ == '__main__':
        mean_value=[args.mean_value_B, args.mean_value_G, args.mean_value_R],
        apply_distort=False,
        apply_expand=False,
-        ap_version=args.ap_version,
+        ap_version=args.ap_version)
-        toy=0)
    eval(
        args,
        data_args=data_args,

--- a/fluid/PaddleCV/object_detection/main_quant.py
+++ b/fluid/PaddleCV/object_detection/main_quant.py
@@ -85,7 +85,6 @@ def train(args,
    batch_size = train_params['batch_size']
    batch_size_per_device = batch_size // devices_num
-    iters_per_epoc = train_params["train_images"] // batch_size
    num_workers = 4
    startup_prog = fluid.Program()
@@ -134,22 +133,22 @@ def train(args,
                                train_file_list,
                                batch_size_per_device,
                                shuffle=is_shuffle,
-                                use_multiprocessing=True,
+                                num_workers=num_workers)
-                                num_workers=num_workers,
-                                max_queue=24)
    test_reader = reader.test(data_args, val_file_list, batch_size)
    train_py_reader.decorate_paddle_reader(train_reader)
    test_py_reader.decorate_paddle_reader(test_reader)
    train_py_reader.start()
    best_map = 0.
-    try:
+    for epoc in range(epoc_num):
-        for epoc in range(epoc_num):
+        if epoc == 0:
-            if epoc == 0:
+            # test quantized model without quantization-aware training.
-                # test quantized model without quantization-aware training.
+            test_map = test(exe, test_prog, map_eval, test_py_reader)
-                test_map = test(exe, test_prog, map_eval, test_py_reader)
+        batch = 0
-            # train
+        train_py_reader.start()
-            for batch in range(iters_per_epoc):
+        while True:
+            try:
+                # train
                start_time = time.time()
                if parallel:
                    outs = train_exe.run(fetch_list=[loss.name])
@@ -157,18 +156,19 @@ def train(args,
                    outs = exe.run(train_prog, fetch_list=[loss])
                end_time = time.time()
                avg_loss = np.mean(np.array(outs[0]))
-                if batch % 20 == 0:
+                if batch % 10 == 0:
                    print("Epoc {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
                        epoc , batch, avg_loss, end_time - start_time))
-            end_time = time.time()
+            except (fluid.core.EOFException, StopIteration):
-            test_map = test(exe, test_prog, map_eval, test_py_reader)
+                train_reader().close()
-            save_model(exe, train_prog, model_save_dir, str(epoc))
+                train_py_reader.reset()
-            if test_map > best_map:
+                break
-                best_map = test_map
+        test_map = test(exe, test_prog, map_eval, test_py_reader)
-                save_model(exe, train_prog, model_save_dir, 'best_map')
+        save_model(exe, train_prog, model_save_dir, str(epoc))
-            print("Best test map {0}".format(best_map))
+        if test_map > best_map:
-    except (fluid.core.EOFException, StopIteration):
+            best_map = test_map
-        train_py_reader.reset()
+            save_model(exe, train_prog, model_save_dir, 'best_map')
+        print("Best test map {0}".format(best_map))
 def eval(args, data_args, configs, val_file_list):
@@ -212,6 +212,9 @@ def eval(args, data_args, configs, val_file_list):
    test_map = test(exe, test_prog, map_eval, test_py_reader)
    print("Test model {0}, map {1}".format(init_model, test_map))
+    # convert model to 8-bit before saving, but now Paddle can't load
+    # the 8-bit model to do inference.
+    # transpiler.convert_to_int8(test_prog, place)
    fluid.io.save_inference_model(model_save_dir, [image.name],
                                  [nmsed_out], exe, test_prog)

--- a/fluid/PaddleCV/object_detection/reader.py
+++ b/fluid/PaddleCV/object_detection/reader.py
@@ -12,17 +12,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import image_util
-from paddle.utils.image_util import *
-from PIL import Image
-from PIL import ImageDraw
-import numpy as np
 import xml.etree.ElementTree
 import os
 import time
 import copy
 import six
-from data_util import GeneratorEnqueuer
+import math
+import numpy as np
+from PIL import Image
+from PIL import ImageDraw
+import image_util
+import paddle
 class Settings(object):
@@ -162,24 +162,19 @@ def preprocess(img, bbox_labels, mode, settings):
    return img, sampled_labels
-def coco(settings, file_list, mode, batch_size, shuffle):
+def coco(settings, coco_api, file_list, mode, batch_size, shuffle, data_dir):
-    # cocoapi
    from pycocotools.coco import COCO
-    from pycocotools.cocoeval import COCOeval
-    coco = COCO(file_list)
-    image_ids = coco.getImgIds()
-    images = coco.loadImgs(image_ids)
-    print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
    def reader():
        if mode == 'train' and shuffle:
-            np.random.shuffle(images)
+            np.random.shuffle(file_list)
        batch_out = []
-        for image in images:
+        for image in file_list:
            image_name = image['file_name']
-            image_path = os.path.join(settings.data_dir, image_name)
+            image_path = os.path.join(data_dir, image_name)
+            if not os.path.exists(image_path):
+                raise ValueError("%s is not exist, you should specify "
+                                 "data path correctly." % image_path)
            im = Image.open(image_path)
            if im.mode == 'L':
                im = im.convert('RGB')
@@ -188,8 +183,8 @@ def coco(settings, file_list, mode, batch_size, shuffle):
            # layout: category_id | xmin | ymin | xmax | ymax | iscrowd
            bbox_labels = []
-            annIds = coco.getAnnIds(imgIds=image['id'])
+            annIds = coco_api.getAnnIds(imgIds=image['id'])
-            anns = coco.loadAnns(annIds)
+            anns = coco_api.loadAnns(annIds)
            for ann in anns:
                bbox_sample = []
                # start from 1, leave 0 to background
@@ -229,20 +224,18 @@ def coco(settings, file_list, mode, batch_size, shuffle):
 def pascalvoc(settings, file_list, mode, batch_size, shuffle):
-    flist = open(file_list)
-    images = [line.strip() for line in flist]
-    print("{} on {} with {} images".format(mode, settings.dataset, len(images)))
    def reader():
        if mode == 'train' and shuffle:
-            np.random.shuffle(images)
+            np.random.shuffle(file_list)
        batch_out = []
        cnt = 0
-        for image in images:
+        for image in file_list:
            image_path, label_path = image.split()
            image_path = os.path.join(settings.data_dir, image_path)
            label_path = os.path.join(settings.data_dir, label_path)
+            if not os.path.exists(image_path):
+                raise ValueError("%s is not exist, you should specify "
+                                 "data path correctly." % image_path)
            im = Image.open(image_path)
            if im.mode == 'L':
                im = im.convert('RGB')
@@ -290,57 +283,62 @@ def train(settings,
          file_list,
          batch_size,
          shuffle=True,
-          use_multiprocessing=True,
          num_workers=8,
-          max_queue=24,
          enable_ce=False):
-    file_list = os.path.join(settings.data_dir, file_list)
+    file_path = os.path.join(settings.data_dir, file_list)
+    readers = []
    if 'coco' in settings.dataset:
-        generator = coco(settings, file_list, "train", batch_size, shuffle)
+        # cocoapi
+        from pycocotools.coco import COCO
+        coco_api = COCO(file_path)
+        image_ids = coco_api.getImgIds()
+        images = coco_api.loadImgs(image_ids)
+        n = int(math.ceil(len(images) // num_workers))
+        image_lists = [images[i:i + n] for i in range(0, len(images), n)]
+        if '2014' in file_list:
+            sub_dir = "train2014"
+        elif '2017' in file_list:
+            sub_dir = "train2017"
+        data_dir = os.path.join(settings.data_dir, sub_dir)
+        for l in image_lists:
+            readers.append(
+                coco(settings, coco_api, l, 'train', batch_size, shuffle,
+                     data_dir))
    else:
-        generator = pascalvoc(settings, file_list, "train", batch_size, shuffle)
+        images = [line.strip() for line in open(file_path)]
+        n = int(math.ceil(len(images) // num_workers))
-    def infinite_reader():
+        image_lists = [images[i:i + n] for i in range(0, len(images), n)]
-        while True:
+        for l in image_lists:
-            for data in generator():
+            readers.append(pascalvoc(settings, l, 'train', batch_size, shuffle))
-                yield data
-    def reader():
+    return paddle.reader.multiprocess_reader(readers, False)
-        try:
-            enqueuer = GeneratorEnqueuer(
-                infinite_reader(), use_multiprocessing=use_multiprocessing)
-            enqueuer.start(max_queue_size=max_queue, workers=num_workers)
-            generator_output = None
-            while True:
-                while enqueuer.is_running():
-                    if not enqueuer.queue.empty():
-                        generator_output = enqueuer.queue.get()
-                        break
-                    else:
-                        time.sleep(0.02)
-                yield generator_output
-                generator_output = None
-        finally:
-            if enqueuer is not None:
-                enqueuer.stop()
-    if enable_ce:
-        return infinite_reader
-    else:
-        return reader
 def test(settings, file_list, batch_size):
    file_list = os.path.join(settings.data_dir, file_list)
    if 'coco' in settings.dataset:
-        return coco(settings, file_list, 'test', batch_size, False)
+        from pycocotools.coco import COCO
+        coco_api = COCO(file_list)
+        image_ids = coco_api.getImgIds()
+        images = coco_api.loadImgs(image_ids)
+        if '2014' in file_list:
+            sub_dir = "val2014"
+        elif '2017' in file_list:
+            sub_dir = "val2017"
+        data_dir = os.path.join(settings.data_dir, sub_dir)
+        return coco(settings, coco_api, images, 'test', batch_size, False,
+                    data_dir)
    else:
-        return pascalvoc(settings, file_list, 'test', batch_size, False)
+        image_list = [line.strip() for line in open(file_list)]
+        return pascalvoc(settings, image_list, 'test', batch_size, False)
 def infer(settings, image_path):
    def reader():
+        if not os.path.exists(image_path):
+            raise ValueError("%s is not exist, you should specify "
+                             "data path correctly." % image_path)
        img = Image.open(image_path)
        if img.mode == 'L':
            img = im.convert('RGB')

--- a/fluid/PaddleCV/object_detection/train.py
+++ b/fluid/PaddleCV/object_detection/train.py
@@ -105,7 +105,7 @@ def build_program(main_prog, startup_prog, train_params, is_train):
                with fluid.unique_name.guard("inference"):
                    nmsed_out = fluid.layers.detection_output(
                        locs, confs, box, box_var, nms_threshold=0.45)
-                    map_eval = fluid.evaluator.DetectionMAP(
+                    map_eval = fluid.metrics.DetectionMAP(
                        nmsed_out,
                        gt_label,
                        gt_box,
@@ -141,7 +141,6 @@ def train(args,
    batch_size = train_params['batch_size']
    epoc_num = train_params['epoc_num']
    batch_size_per_device = batch_size // devices_num
-    iters_per_epoc = train_params["train_images"] // batch_size
    num_workers = 8
    startup_prog = fluid.Program()
@@ -186,9 +185,7 @@ def train(args,
                                train_file_list,
                                batch_size_per_device,
                                shuffle=is_shuffle,
-                                use_multiprocessing=True,
                                num_workers=num_workers,
-                                max_queue=24,
                                enable_ce=enable_ce)
    test_reader = reader.test(data_args, val_file_list, batch_size)
    train_py_reader.decorate_paddle_reader(train_reader)
@@ -205,7 +202,7 @@ def train(args,
    def test(epoc_id, best_map):
        _, accum_map = map_eval.get_map_var()
        map_eval.reset(exe)
-        every_epoc_map=[]
+        every_epoc_map=[] # for CE
        test_py_reader.start()
        try:
            batch_id = 0
@@ -218,22 +215,23 @@ def train(args,
        except fluid.core.EOFException:
            test_py_reader.reset()
        mean_map = np.mean(every_epoc_map)
-        print("Epoc {0}, test map {1}".format(epoc_id, test_map))
+        print("Epoc {0}, test map {1}".format(epoc_id, test_map[0]))
        if test_map[0] > best_map:
            best_map = test_map[0]
            save_model('best_model', test_prog)
        return best_map, mean_map
-    train_py_reader.start()
    total_time = 0.0
-    try:
+    for epoc_id in range(epoc_num):
-        for epoc_id in range(epoc_num):
+        epoch_idx = epoc_id + 1
-            epoch_idx = epoc_id + 1
+        start_time = time.time()
-            start_time = time.time()
+        prev_start_time = start_time
-            prev_start_time = start_time
+        every_epoc_loss = []
-            every_epoc_loss = []
+        batch_id = 0
-            for batch_id in range(iters_per_epoc):
+        train_py_reader.start()
+        while True:
+            try:
                prev_start_time = start_time
                start_time = time.time()
                if parallel:
@@ -242,34 +240,35 @@ def train(args,
                    loss_v, = exe.run(train_prog, fetch_list=[loss])
                loss_v = np.mean(np.array(loss_v))
                every_epoc_loss.append(loss_v)
-                if batch_id % 20 == 0:
+                if batch_id % 10 == 0:
                    print("Epoc {:d}, batch {:d}, loss {:.6f}, time {:.5f}".format(
                        epoc_id, batch_id, loss_v, start_time - prev_start_time))
-            end_time = time.time()
+                batch_id += 1
-            total_time += end_time - start_time
+            except (fluid.core.EOFException, StopIteration):
+                train_reader().close()
-            best_map, mean_map = test(epoc_id, best_map)
+                train_py_reader.reset()
-            print("Best test map {0}".format(best_map))
+                break
-            if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
-                save_model(str(epoc_id), train_prog)
+        end_time = time.time()
+        total_time += end_time - start_time
-            if enable_ce and epoc_id == epoc_num - 1:
+        best_map, mean_map = test(epoc_id, best_map)
-                train_avg_loss = np.mean(every_epoc_loss)
+        print("Best test map {0}".format(best_map))
-                if devices_num == 1:
+        if epoc_id % 10 == 0 or epoc_id == epoc_num - 1:
-                    print("kpis	train_cost	%s" % train_avg_loss)
+            save_model(str(epoc_id), train_prog)
-                    print("kpis	test_acc	%s" % mean_map)
-                    print("kpis	train_speed	%s" % (total_time / epoch_idx))
+    if enable_ce:
-                else:
+        train_avg_loss = np.mean(every_epoc_loss)
-                    print("kpis	train_cost_card%s	%s" %
+        if devices_num == 1:
-                           (devices_num, train_avg_loss))
+            print("kpis	train_cost	%s" % train_avg_loss)
-                    print("kpis	test_acc_card%s	%s" %
+            print("kpis	test_acc	%s" % mean_map)
-                           (devices_num, mean_map))
+            print("kpis	train_speed	%s" % (total_time / epoch_idx))
-                    print("kpis	train_speed_card%s	%f" %
+        else:
-                           (devices_num, total_time / epoch_idx))
+            print("kpis	train_cost_card%s	%s" %
+                   (devices_num, train_avg_loss))
-    except (fluid.core.EOFException, StopIteration):
+            print("kpis	test_acc_card%s	%s" %
-        train_reader().close()
+                   (devices_num, mean_map))
-        train_py_reader.reset()
+            print("kpis	train_speed_card%s	%f" %
+                   (devices_num, total_time / epoch_idx))
 if __name__ == '__main__':

--- a/fluid/PaddleCV/ocr_recognition/README.md
+++ b/fluid/PaddleCV/ocr_recognition/README.md
@@ -80,7 +80,7 @@
 在训练时，我们通过选项`--train_images` 和 `--train_list` 分别设置准备好的`train_images` 和`train_list`。
->**注：** 如果`--train_images` 和 `--train_list`都未设置或设置为None， ctc_reader.py会自动下载使用[示例数据](http://paddle-ocr-data.bj.bcebos.com/data.tar.gz)，并将其缓存到`$HOME/.cache/paddle/dataset/ctc_data/data/` 路径下。
+>**注：** 如果`--train_images` 和 `--train_list`都未设置或设置为None， reader.py会自动下载使用[示例数据](http://paddle-ocr-data.bj.bcebos.com/data.tar.gz)，并将其缓存到`$HOME/.cache/paddle/dataset/ctc_data/data/` 路径下。
 **B. 测试集和评估集**
@@ -119,17 +119,17 @@ data/test_images/00003.jpg
 使用默认数据在GPU单卡上训练:
 ```
-env CUDA_VISIBLE_DEVICES=0 python ctc_train.py
+env CUDA_VISIBLE_DEVICES=0 python train.py
 ```
 使用默认数据在CPU上训练:
 ```
-env OMP_NUM_THREADS=<num_of_physical_cores> python ctc_train.py --use_gpu False --parallel=False
+env OMP_NUM_THREADS=<num_of_physical_cores> python train.py --use_gpu False --parallel=False
 ```
 使用默认数据在GPU多卡上训练:
 ```
-env CUDA_VISIBLE_DEVICES=0,1,2,3 python ctc_train.py --parallel=True
+env CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py --parallel=True
 ```
 默认使用的是`CTC model`, 可以通过选项`--model="attention"`切换为`attention model`。

--- a/fluid/PaddleNLP/chinese_ner/README.md
+++ b/fluid/PaddleNLP/chinese_ner/README.md
@@ -15,7 +15,14 @@
 在data目录下，有两个文件夹，train_files中保存的是训练数据，test_files中保存的是测试数据，作为示例，在目录下我们各放置了两个文件，实际训练时，根据自己的实际需要将数据放置在对应目录，并根据数据格式，修改reader.py中的数据读取函数。
 ## 训练
-修改 [train.py](./train.py) 的 `main` 函数，指定数据路径，运行`python train.py`开始训练。
+通过运行
+```
+python train.py --help
+```
+来获取命令行参数的帮助，设置正确的数据路径等参数后，运行`train.py`开始训练。
 训练记录形如
 ```txt
@@ -31,7 +38,7 @@ pass_id:2, time_cost:0.740842103958s
 ```
 ## 预测
-修改 [infer.py](./infer.py) 的 `infer` 函数，指定：需要测试的模型的路径、测试数据、预测标记文件的路径，运行`python infer.py`开始预测。
+类似于训练过程，预测时指定需要测试模型的路径、测试数据、预测标记文件的路径，运行`infer.py`开始预测。
 预测结果如下
 ```txt

--- a/fluid/PaddleNLP/chinese_ner/infer.py
+++ b/fluid/PaddleNLP/chinese_ner/infer.py
@@ -52,7 +52,7 @@ def parse_args():
 def print_arguments(args):
    print('-----------  Configuration Arguments -----------')
-    for arg, value in sorted(vars(args).iteritems()):
+    for arg, value in sorted(vars(args).items()):
        print('%s: %s' % (arg, value))
    print('------------------------------------------------')
@@ -61,6 +61,7 @@ def load_reverse_dict(dict_path):
    return dict((idx, line.strip().split("\t")[0])
                for idx, line in enumerate(open(dict_path, "r").readlines()))
 def to_lodtensor(data, place):
    seq_lens = [len(seq) for seq in data]
    cur_len = 0
@@ -76,7 +77,6 @@ def to_lodtensor(data, place):
    return res
 def infer(args):
    word = fluid.layers.data(name='word', shape=[1], dtype='int64', lod_level=1)
    mention = fluid.layers.data(
@@ -108,8 +108,8 @@ def infer(args):
                profiler.reset_profiler()
            iters = 0
            for data in test_data():
-                word = to_lodtensor(map(lambda x: x[0], data), place)
+                word = to_lodtensor(list(map(lambda x: x[0], data)), place)
-                mention = to_lodtensor(map(lambda x: x[1], data), place)
+                mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
                start = time.time()
                crf_decode = exe.run(inference_program,
@@ -122,12 +122,12 @@ def infer(args):
                np_data = np.array(crf_decode[0])
                word_count = 0
                assert len(data) == len(lod_info) - 1
-                for sen_index in xrange(len(data)):
+                for sen_index in range(len(data)):
                    assert len(data[sen_index][0]) == lod_info[
                        sen_index + 1] - lod_info[sen_index]
                    word_index = 0
-                    for tag_index in xrange(lod_info[sen_index],
+                    for tag_index in range(lod_info[sen_index],
-                                            lod_info[sen_index + 1]):
+                                           lod_info[sen_index + 1]):
                        word = str(data[sen_index][0][word_index])
                        gold_tag = label_reverse_dict[data[sen_index][2][
                            word_index]]

--- a/fluid/PaddleNLP/chinese_ner/train.py
+++ b/fluid/PaddleNLP/chinese_ner/train.py
@@ -12,7 +12,7 @@ import reader
 def parse_args():
-    parser = argparse.ArgumentParser("Run inference.")
+    parser = argparse.ArgumentParser("Run training.")
    parser.add_argument(
        '--batch_size',
        type=int,
@@ -65,7 +65,7 @@ def parse_args():
 def print_arguments(args):
    print('-----------  Configuration Arguments -----------')
-    for arg, value in sorted(vars(args).iteritems()):
+    for arg, value in sorted(vars(args).items()):
        print('%s: %s' % (arg, value))
    print('------------------------------------------------')
@@ -220,9 +220,9 @@ def test2(exe, chunk_evaluator, inference_program, test_data, place,
          cur_fetch_list):
    chunk_evaluator.reset()
    for data in test_data():
-        word = to_lodtensor(map(lambda x: x[0], data), place)
+        word = to_lodtensor(list(map(lambda x: x[0], data)), place)
-        mention = to_lodtensor(map(lambda x: x[1], data), place)
+        mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
-        target = to_lodtensor(map(lambda x: x[2], data), place)
+        target = to_lodtensor(list(map(lambda x: x[2], data)), place)
        result_list = exe.run(
            inference_program,
            feed={"word": word,
@@ -232,8 +232,9 @@ def test2(exe, chunk_evaluator, inference_program, test_data, place,
        number_infer = np.array(result_list[0])
        number_label = np.array(result_list[1])
        number_correct = np.array(result_list[2])
-        chunk_evaluator.update(number_infer[0], number_label[0],
+        chunk_evaluator.update(number_infer[0].astype('int64'),
-                               number_correct[0])
+                               number_label[0].astype('int64'),
+                               number_correct[0].astype('int64'))
    return chunk_evaluator.eval()
@@ -241,9 +242,9 @@ def test(test_exe, chunk_evaluator, inference_program, test_data, place,
         cur_fetch_list):
    chunk_evaluator.reset()
    for data in test_data():
-        word = to_lodtensor(map(lambda x: x[0], data), place)
+        word = to_lodtensor(list(map(lambda x: x[0], data)), place)
-        mention = to_lodtensor(map(lambda x: x[1], data), place)
+        mention = to_lodtensor(list(map(lambda x: x[1], data)), place)
-        target = to_lodtensor(map(lambda x: x[2], data), place)
+        target = to_lodtensor(list(map(lambda x: x[2], data)), place)
        result_list = test_exe.run(
            fetch_list=cur_fetch_list,
            feed={"word": word,
@@ -252,8 +253,9 @@ def test(test_exe, chunk_evaluator, inference_program, test_data, place,
        number_infer = np.array(result_list[0])
        number_label = np.array(result_list[1])
        number_correct = np.array(result_list[2])
-        chunk_evaluator.update(number_infer.sum(),
+        chunk_evaluator.update(number_infer.sum().astype('int64'),
-                               number_label.sum(), number_correct.sum())
+                               number_label.sum().astype('int64'),
+                               number_correct.sum().astype('int64'))
    return chunk_evaluator.eval()
@@ -270,11 +272,6 @@ def main(args):
        crf_decode = fluid.layers.crf_decoding(
            input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
-        inference_program = fluid.default_main_program().clone(for_test=True)
-        sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
-        sgd_optimizer.minimize(avg_cost)
        (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
         num_correct_chunks) = fluid.layers.chunk_eval(
             input=crf_decode,
@@ -282,6 +279,11 @@ def main(args):
             chunk_scheme="IOB",
             num_chunk_types=int(math.ceil((args.label_dict_len - 1) / 2.0)))
+        inference_program = fluid.default_main_program().clone(for_test=True)
+        sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+        sgd_optimizer.minimize(avg_cost)
        chunk_evaluator = fluid.metrics.ChunkEvaluator()
        train_reader = paddle.batch(
@@ -312,7 +314,7 @@ def main(args):
            test_exe = exe
        batch_id = 0
-        for pass_id in xrange(args.num_passes):
+        for pass_id in range(args.num_passes):
            chunk_evaluator.reset()
            train_reader_iter = train_reader()
            start_time = time.time()
@@ -326,9 +328,9 @@ def main(args):
                        ],
                        feed=feeder.feed(cur_batch))
                    chunk_evaluator.update(
-                        np.array(nums_infer).sum(),
+                        np.array(nums_infer).sum().astype("int64"),
-                        np.array(nums_label).sum(),
+                        np.array(nums_label).sum().astype("int64"),
-                        np.array(nums_correct).sum())
+                        np.array(nums_correct).sum().astype("int64"))
                    cost_list = np.array(cost)
                    batch_id += 1
                except StopIteration:

--- a/fluid/PaddleNLP/deep_attention_matching_net/_ce.py
+++ b/fluid/PaddleNLP/deep_attention_matching_net/_ce.py
@@ -7,8 +7,8 @@ from kpi import CostKpi, DurationKpi, AccKpi
 #### NOTE kpi.py should shared in models in some way!!!!
-train_cost_kpi = CostKpi('train_cost', 0.02, actived=True)
+train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
-train_duration_kpi = DurationKpi('train_duration', 0.05, actived=True)
+train_duration_kpi = DurationKpi('train_duration', 0.05, 0, actived=True)
 tracking_kpis = [
    train_cost_kpi,

--- a/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py
+++ b/fluid/PaddleNLP/deep_attention_matching_net/train_and_evaluate.py
@@ -248,8 +248,9 @@ def train(args):
    print("device count %d" % dev_count)
    print("theoretical memory usage: ")
-    print(fluid.contrib.memory_usage(
+    print(
-        program=train_program, batch_size=args.batch_size))
+        fluid.contrib.memory_usage(
+            program=train_program, batch_size=args.batch_size))
    exe = fluid.Executor(place)
    exe.run(train_startup)
@@ -318,8 +319,9 @@ def train(args):
            if (args.save_path is not None) and (step % save_step == 0):
                save_path = os.path.join(args.save_path, "step_" + str(step))
                print("Save model at step %d ... " % step)
-                print(time.strftime('%Y-%m-%d %H:%M:%S',
+                print(
-                                    time.localtime(time.time())))
+                    time.strftime('%Y-%m-%d %H:%M:%S',
+                                  time.localtime(time.time())))
                fluid.io.save_persistables(exe, save_path, train_program)
                score_path = os.path.join(args.save_path, 'score.' + str(step))
@@ -358,8 +360,9 @@ def train(args):
                    save_path = os.path.join(args.save_path,
                                             "step_" + str(step))
                    print("Save model at step %d ... " % step)
-                    print(time.strftime('%Y-%m-%d %H:%M:%S',
+                    print(
-                                        time.localtime(time.time())))
+                        time.strftime('%Y-%m-%d %H:%M:%S',
+                                      time.localtime(time.time())))
                    fluid.io.save_persistables(exe, save_path, train_program)
                    score_path = os.path.join(args.save_path,
@@ -389,7 +392,11 @@ def train(args):
            global_step, last_cost = train_with_pyreader(global_step)
        else:
            global_step, last_cost = train_with_feed(global_step)
-        train_time += time.time() - begin_time
+        pass_time_cost = time.time() - begin_time
+        train_time += pass_time_cost
+        print("Pass {0}, pass_time_cost {1}"
+              .format(epoch, "%2.2f sec" % pass_time_cost))
    # For internal continuous evaluation
    if "CE_MODE_X" in os.environ:
        print("kpis	train_cost	%f" % last_cost)

--- a/fluid/PaddleNLP/machine_reading_comprehension/_ce.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/_ce.py
@@ -3,6 +3,7 @@
 import os
 import sys
 #sys.path.insert(0, os.environ['ceroot'])
+sys.path.append(os.environ['ceroot'])
 from kpi import CostKpi, DurationKpi, AccKpi
 #### NOTE kpi.py should shared in models in some way!!!!

--- a/fluid/PaddleNLP/machine_reading_comprehension/dataset.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/dataset.py
@@ -23,6 +23,7 @@ import json
 import logging
 import numpy as np
 from collections import Counter
+import io
 class BRCDataset(object):
@@ -67,7 +68,7 @@ class BRCDataset(object):
        Args:
            data_path: the data file to load
        """
-        with open(data_path) as fin:
+        with io.open(data_path, 'r', encoding='utf-8') as fin:
            data_set = []
            for lidx, line in enumerate(fin):
                sample = json.loads(line.strip())

--- a/fluid/PaddleNLP/machine_reading_comprehension/run.py
+++ b/fluid/PaddleNLP/machine_reading_comprehension/run.py
@@ -22,6 +22,7 @@ import os
 import random
 import json
 import six
+import multiprocessing
 import paddle
 import paddle.fluid as fluid
@@ -445,7 +446,9 @@ def train(logger, args):
                            logger.info('Dev eval result: {}'.format(
                                bleu_rouge))
                pass_end_time = time.time()
+                time_consumed = pass_end_time - pass_start_time
+                logger.info('epoch: {0}, epoch_time_cost: {1:.2f}'.format(
+                    pass_id, time_consumed))
                logger.info('Evaluating the model after epoch {}'.format(
                    pass_id))
                if brc_data.dev_set is not None:
@@ -458,7 +461,7 @@ def train(logger, args):
                else:
                    logger.warning(
                        'No dev set is loaded for evaluation in the dataset!')
-                time_consumed = pass_end_time - pass_start_time
                logger.info('Average train loss for epoch {} is {}'.format(
                    pass_id, "%.10f" % (1.0 * total_loss / total_num)))

--- a/fluid/PaddleNLP/neural_machine_translation/transformer/train.py
+++ b/fluid/PaddleNLP/neural_machine_translation/transformer/train.py
@@ -408,10 +408,19 @@ def test_context(exe, train_exe, dev_count):
    test_data = prepare_data_generator(
        args, is_test=True, count=dev_count, pyreader=pyreader)
-    exe.run(startup_prog)
+    exe.run(startup_prog)  # to init pyreader for testing
+    if TrainTaskConfig.ckpt_path:
+        fluid.io.load_persistables(
+            exe, TrainTaskConfig.ckpt_path, main_program=test_prog)
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.use_experimental_executor = True
+    build_strategy = fluid.BuildStrategy()
    test_exe = fluid.ParallelExecutor(
        use_cuda=TrainTaskConfig.use_gpu,
        main_program=test_prog,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
        share_vars_from=train_exe)
    def test(exe=test_exe, pyreader=pyreader):
@@ -457,7 +466,11 @@ def train_loop(exe,
               nccl2_trainer_id=0):
    # Initialize the parameters.
    if TrainTaskConfig.ckpt_path:
-        fluid.io.load_persistables(exe, TrainTaskConfig.ckpt_path)
+        exe.run(startup_prog)  # to init pyreader for training
+        logging.info("load checkpoint from {}".format(
+            TrainTaskConfig.ckpt_path))
+        fluid.io.load_persistables(
+            exe, TrainTaskConfig.ckpt_path, main_program=train_prog)
    else:
        logging.info("init fluid.framework.default_startup_program")
        exe.run(startup_prog)
@@ -469,7 +482,7 @@ def train_loop(exe,
    # For faster executor
    exec_strategy = fluid.ExecutionStrategy()
    exec_strategy.use_experimental_executor = True
-    # exec_strategy.num_iteration_per_drop_scope = 5
+    exec_strategy.num_iteration_per_drop_scope = int(args.fetch_steps)
    build_strategy = fluid.BuildStrategy()
    # Since the token number differs among devices, customize gradient scale to
    # use token average cost among multi-devices. and the gradient scale is
@@ -741,6 +754,7 @@ if __name__ == "__main__":
    LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
    logging.basicConfig(
        stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT)
+    logging.getLogger().setLevel(logging.INFO)
    args = parse_args()
    train(args)
--- a/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py
+++ b/fluid/PaddleNLP/sequence_tagging_for_ner/infer.py
@@ -38,12 +38,10 @@ def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
        for data in test_data():
            word = to_lodtensor([x[0] for x in data], place)
            mark = to_lodtensor([x[1] for x in data], place)
-            target = to_lodtensor([x[2] for x in data], place)
            crf_decode = exe.run(
                inference_program,
                feed={"word": word,
-                      "mark": mark,
+                      "mark": mark},
-                      "target": target},
                fetch_list=fetch_targets,
                return_numpy=False)
            lod_info = (crf_decode[0].lod())[0]

--- a/fluid/PaddleNLP/sequence_tagging_for_ner/train.py
+++ b/fluid/PaddleNLP/sequence_tagging_for_ner/train.py
@@ -30,7 +30,9 @@ def test(exe, chunk_evaluator, inference_program, test_data, test_fetch_list,
        num_infer = np.array(rets[0])
        num_label = np.array(rets[1])
        num_correct = np.array(rets[2])
-        chunk_evaluator.update(num_infer[0], num_label[0], num_correct[0])
+        chunk_evaluator.update(num_infer[0].astype('int64'),
+                               num_label[0].astype('int64'),
+                               num_correct[0].astype('int64'))
    return chunk_evaluator.eval()
@@ -61,9 +63,6 @@ def main(train_data_file,
    avg_cost, feature_out, word, mark, target = ner_net(
        word_dict_len, label_dict_len, parallel)
-    sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
-    sgd_optimizer.minimize(avg_cost)
    crf_decode = fluid.layers.crf_decoding(
        input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
@@ -77,6 +76,8 @@ def main(train_data_file,
    inference_program = fluid.default_main_program().clone(for_test=True)
    test_fetch_list = [num_infer_chunks, num_label_chunks, num_correct_chunks]
+    sgd_optimizer = fluid.optimizer.SGD(learning_rate=1e-3)
+    sgd_optimizer.minimize(avg_cost)
    if "CE_MODE_X" not in os.environ:
        train_reader = paddle.batch(
@@ -135,7 +136,7 @@ def main(train_data_file,
              " pass_f1_score:" + str(test_pass_f1_score))
        save_dirname = os.path.join(model_save_dir, "params_pass_%d" % pass_id)
-        fluid.io.save_inference_model(save_dirname, ['word', 'mark', 'target'],
+        fluid.io.save_inference_model(save_dirname, ['word', 'mark'],
                                      crf_decode, exe)
    if "CE_MODE_X" in os.environ:

--- a/fluid/PaddleNLP/text_classification/README.md
+++ b/fluid/PaddleNLP/text_classification/README.md
@@ -14,7 +14,7 @@
 ## 简介，模型详解
-在PaddlePaddle v2版本[文本分类](https://github.com/PaddlePaddle/models/blob/develop/text/README.md)中对于文本分类任务有较详细的介绍，在本例中不再重复介绍。
+在PaddlePaddle v2版本[文本分类](https://github.com/PaddlePaddle/models/blob/develop/legacy/text_classification/README.md)中对于文本分类任务有较详细的介绍，在本例中不再重复介绍。
 在模型上，我们采用了bow, cnn, lstm, gru四种常见的文本分类模型。
 ## 训练

--- a/fluid/PaddleNLP/text_classification/async_executor/README.md
+++ b/fluid/PaddleNLP/text_classification/async_executor/README.md
+# 文本分类
+以下是本例的简要目录结构及说明：
+```text
+.
+|-- README.md               # README
+|-- data_generator          # IMDB数据集生成工具
+|   |-- IMDB.py             # 在data_generator.py基础上扩展IMDB数据集处理逻辑
+|   |-- build_raw_data.py   # IMDB数据预处理，其产出被splitfile.py读取。格式：word word ... | label
+|   |-- data_generator.py   # 与AsyncExecutor配套的数据生成工具框架
+|   `-- splitfile.py        # 将build_raw_data.py生成的文件切分，其产出被IMDB.py读取
+|-- data_generator.sh       # IMDB数据集生成工具入口
+|-- data_reader.py          # 预测脚本使用的数据读取工具
+|-- infer.py                # 预测脚本
+`-- train.py                # 训练脚本
+```
+## 简介
+本目录包含用fluid.AsyncExecutor训练文本分类任务的脚本。网络模型定义沿用自父目录nets.py
+## 训练
+1. 运行命令 `sh data_generator.sh`，下载IMDB数据集，并转化成适合AsyncExecutor读取的训练数据
+2. 运行命令 `python train.py bow` 开始训练模型。
+    ```python
+    python train.py bow    # bow指定网络结构，可替换成cnn, lstm, gru
+    ```
+3. (可选）想自定义网络结构，需在[nets.py](../nets.py)中自行添加，并设置[train.py](./train.py)中的相应参数。
+    ```python
+    def train(train_reader,     # 训练数据
+        word_dict,              # 数据字典
+        network,                # 模型配置
+        use_cuda,               # 是否用GPU
+        parallel,               # 是否并行
+        save_dirname,           # 保存模型路径
+        lr=0.2,                 # 学习率大小
+        batch_size=128,         # 每个batch的样本数
+        pass_num=30):           # 训练的轮数
+    ```
+## 训练结果示例
+```text
+pass_id: 0 pass_time_cost 4.723438
+pass_id: 1 pass_time_cost 3.867186
+pass_id: 2 pass_time_cost 4.490111
+pass_id: 3 pass_time_cost 4.573296
+pass_id: 4 pass_time_cost 4.180547
+pass_id: 5 pass_time_cost 4.214476
+pass_id: 6 pass_time_cost 4.520387
+pass_id: 7 pass_time_cost 4.149485
+pass_id: 8 pass_time_cost 3.821354
+pass_id: 9 pass_time_cost 5.136178
+pass_id: 10 pass_time_cost 4.137318
+pass_id: 11 pass_time_cost 3.943429
+pass_id: 12 pass_time_cost 3.766478
+pass_id: 13 pass_time_cost 4.235983
+pass_id: 14 pass_time_cost 4.796462
+pass_id: 15 pass_time_cost 4.668116
+pass_id: 16 pass_time_cost 4.373798
+pass_id: 17 pass_time_cost 4.298131
+pass_id: 18 pass_time_cost 4.260021
+pass_id: 19 pass_time_cost 4.244411
+pass_id: 20 pass_time_cost 3.705138
+pass_id: 21 pass_time_cost 3.728070
+pass_id: 22 pass_time_cost 3.817919
+pass_id: 23 pass_time_cost 4.698598
+pass_id: 24 pass_time_cost 4.859262
+pass_id: 25 pass_time_cost 5.725732
+pass_id: 26 pass_time_cost 5.102599
+pass_id: 27 pass_time_cost 3.876582
+pass_id: 28 pass_time_cost 4.762538
+pass_id: 29 pass_time_cost 3.797759
+```
+与fluid.Executor不同，AsyncExecutor在每个pass结束不会将accuracy打印出来。为了观察训练过程，可以将fluid.AsyncExecutor.run()方法的Debug参数设为True，这样每个pass结束会把参数指定的fetch variable打印出来：
+```
+async_executor.run(
+    main_program,
+    dataset,
+    filelist,
+    thread_num,
+    [acc],
+    debug=True)
+```
+## 预测
+1. 运行命令 `python infer.py bow_model`, 开始预测。
+    ```python
+    python infer.py bow_model     # bow_model指定需要导入的模型
+    ```
+## 预测结果示例
+```text
+model_path: bow_model/epoch0.model, avg_acc: 0.882600
+model_path: bow_model/epoch1.model, avg_acc: 0.887920
+model_path: bow_model/epoch2.model, avg_acc: 0.886920
+model_path: bow_model/epoch3.model, avg_acc: 0.884720
+model_path: bow_model/epoch4.model, avg_acc: 0.879760
+model_path: bow_model/epoch5.model, avg_acc: 0.876920
+model_path: bow_model/epoch6.model, avg_acc: 0.874160
+model_path: bow_model/epoch7.model, avg_acc: 0.872000
+model_path: bow_model/epoch8.model, avg_acc: 0.870360
+model_path: bow_model/epoch9.model, avg_acc: 0.868480
+model_path: bow_model/epoch10.model, avg_acc: 0.867240
+model_path: bow_model/epoch11.model, avg_acc: 0.866200
+model_path: bow_model/epoch12.model, avg_acc: 0.865560
+model_path: bow_model/epoch13.model, avg_acc: 0.865160
+model_path: bow_model/epoch14.model, avg_acc: 0.864480
+model_path: bow_model/epoch15.model, avg_acc: 0.864240
+model_path: bow_model/epoch16.model, avg_acc: 0.863800
+model_path: bow_model/epoch17.model, avg_acc: 0.863520
+model_path: bow_model/epoch18.model, avg_acc: 0.862760
+model_path: bow_model/epoch19.model, avg_acc: 0.862680
+model_path: bow_model/epoch20.model, avg_acc: 0.862240
+model_path: bow_model/epoch21.model, avg_acc: 0.862280
+model_path: bow_model/epoch22.model, avg_acc: 0.862080
+model_path: bow_model/epoch23.model, avg_acc: 0.861560
+model_path: bow_model/epoch24.model, avg_acc: 0.861280
+model_path: bow_model/epoch25.model, avg_acc: 0.861160
+model_path: bow_model/epoch26.model, avg_acc: 0.861080
+model_path: bow_model/epoch27.model, avg_acc: 0.860920
+model_path: bow_model/epoch28.model, avg_acc: 0.860800
+model_path: bow_model/epoch29.model, avg_acc: 0.860760
+```
+注：过拟合导致acc持续下降，请忽略
--- a/fluid/PaddleNLP/text_classification/async_executor/data_generator.sh
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator.sh
+#!/usr/bin/env bash
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+pushd .
+cd ./data_generator
+# wget "http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz"
+if [ ! -f aclImdb_v1.tar.gz ]; then
+    wget "http://10.64.74.104:8080/paddle/dataset/imdb/aclImdb_v1.tar.gz"
+fi
+tar zxvf aclImdb_v1.tar.gz
+mkdir train_data
+python build_raw_data.py train | python splitfile.py 12 train_data
+mkdir test_data
+python build_raw_data.py test | python splitfile.py 12 test_data
+/opt/python27/bin/python IMDB.py train_data
+/opt/python27/bin/python IMDB.py test_data
+mv ./output_dataset/train_data ../
+mv ./output_dataset/test_data ../
+cp aclImdb/imdb.vocab ../
+rm -rf ./output_dataset
+rm -rf train_data
+rm -rf test_data
+rm -rf aclImdb
+popd
--- a/fluid/PaddleNLP/text_classification/async_executor/data_generator/IMDB.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/IMDB.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+import os, sys
+sys.path.append(os.path.abspath(os.path.join('..')))
+from data_generator import MultiSlotDataGenerator
+class IMDbDataGenerator(MultiSlotDataGenerator):
+    def load_resource(self, dictfile):
+        self._vocab = {}
+        wid = 0
+        with open(dictfile) as f:
+            for line in f:
+                self._vocab[line.strip()] = wid
+                wid += 1
+        self._unk_id = len(self._vocab)
+        self._pattern = re.compile(r'(;|,|\.|\?|!|\s|\(|\))')
+    def process(self, line):
+        send = '|'.join(line.split('|')[:-1]).lower().replace("<br />",
+                                                              " ").strip()
+        label = [int(line.split('|')[-1])]
+        words = [x for x in self._pattern.split(send) if x and x != " "]
+        feas = [
+            self._vocab[x] if x in self._vocab else self._unk_id for x in words
+        ]
+        return ("words", feas), ("label", label)
+imdb = IMDbDataGenerator()
+imdb.load_resource("aclImdb/imdb.vocab")
+# data from files
+file_names = os.listdir(sys.argv[1])
+filelist = []
+for i in range(0, len(file_names)):
+    filelist.append(os.path.join(sys.argv[1], file_names[i]))
+line_limit = 2500
+process_num = 24
+imdb.run_from_files(
+    filelist=filelist,
+    line_limit=line_limit,
+    process_num=process_num,
+    output_dir=('output_dataset/%s' % (sys.argv[1])))
--- a/fluid/PaddleNLP/text_classification/async_executor/data_generator/data_generator.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/data_generator.py
--- a/fluid/PaddleNLP/text_classification/async_executor/data_generator/splitfile.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_generator/splitfile.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Split file into parts
+"""
+import sys
+import os
+block = int(sys.argv[1])
+datadir = sys.argv[2]
+file_list = []
+for i in range(block):
+    file_list.append(open(datadir + "/part-" + str(i), "w"))
+id_ = 0
+for line in sys.stdin:
+    file_list[id_ % block].write(line)
+    id_ += 1
+for f in file_list:
+    f.close()
--- a/fluid/PaddleNLP/text_classification/async_executor/data_reader.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/data_reader.py
--- a/fluid/PaddleNLP/text_classification/async_executor/infer.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/infer.py
--- a/fluid/PaddleNLP/text_classification/async_executor/train.py
+++ b/fluid/PaddleNLP/text_classification/async_executor/train.py
--- a/fluid/PaddleNLP/text_classification/train.py
+++ b/fluid/PaddleNLP/text_classification/train.py
--- a/fluid/PaddleNLP/text_matching_on_quora/.run_ce.sh
+++ b/fluid/PaddleNLP/text_matching_on_quora/.run_ce.sh
--- a/fluid/PaddleNLP/text_matching_on_quora/__init__.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/__init__.py
--- a/fluid/PaddleNLP/text_matching_on_quora/_ce.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/_ce.py
--- a/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/pretrained_word2vec.py
--- a/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/quora_question_pairs.py
--- a/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py
+++ b/fluid/PaddleNLP/text_matching_on_quora/train_and_evaluate.py
--- a/fluid/PaddleRec/ctr/data/download.sh
+++ b/fluid/PaddleRec/ctr/data/download.sh
--- a/fluid/PaddleRec/ctr/network_conf.py
+++ b/fluid/PaddleRec/ctr/network_conf.py
--- a/fluid/PaddleRec/ctr/preprocess.py
+++ b/fluid/PaddleRec/ctr/preprocess.py
--- a/fluid/PaddleRec/gru4rec/README.md
+++ b/fluid/PaddleRec/gru4rec/README.md
--- a/fluid/PaddleRec/gru4rec/infer_sample_neg.py
+++ b/fluid/PaddleRec/gru4rec/infer_sample_neg.py
--- a/fluid/PaddleRec/gru4rec/net.py
+++ b/fluid/PaddleRec/gru4rec/net.py
--- a/fluid/PaddleRec/ssr/README.md
+++ b/fluid/PaddleRec/ssr/README.md
--- a/fluid/PaddleRec/ssr/infer.py
+++ b/fluid/PaddleRec/ssr/infer.py
--- a/fluid/PaddleRec/ssr/nets.py
+++ b/fluid/PaddleRec/ssr/nets.py
--- a/fluid/PaddleRec/ssr/reader.py
+++ b/fluid/PaddleRec/ssr/reader.py
--- a/fluid/PaddleRec/ssr/test_data/small_test.txt
+++ b/fluid/PaddleRec/ssr/test_data/small_test.txt
--- a/fluid/PaddleRec/ssr/train.py
+++ b/fluid/PaddleRec/ssr/train.py
--- a/fluid/PaddleRec/ssr/train_data/small_train.txt
+++ b/fluid/PaddleRec/ssr/train_data/small_train.txt
--- a/fluid/PaddleRec/ssr/utils.py
+++ b/fluid/PaddleRec/ssr/utils.py
--- a/fluid/PaddleRec/ssr/vocab.txt
+++ b/fluid/PaddleRec/ssr/vocab.txt
--- a/fluid/PaddleRec/tagspace/train.py
+++ b/fluid/PaddleRec/tagspace/train.py
--- a/fluid/PaddleRec/word2vec/README.cn.md
+++ b/fluid/PaddleRec/word2vec/README.cn.md
--- a/fluid/PaddleRec/word2vec/README.md
+++ b/fluid/PaddleRec/word2vec/README.md
--- a/fluid/PaddleRec/word2vec/data/download.sh
+++ b/fluid/PaddleRec/word2vec/data/download.sh
--- a/fluid/PaddleRec/word2vec/infer.py
+++ b/fluid/PaddleRec/word2vec/infer.py
--- a/fluid/PaddleRec/word2vec/network_conf.py
+++ b/fluid/PaddleRec/word2vec/network_conf.py
--- a/fluid/PaddleRec/word2vec/preprocess.py
+++ b/fluid/PaddleRec/word2vec/preprocess.py
--- a/fluid/PaddleRec/word2vec/reader.py
+++ b/fluid/PaddleRec/word2vec/reader.py
--- a/fluid/PaddleRec/word2vec/train.py
+++ b/fluid/PaddleRec/word2vec/train.py
--- a/legacy/README.md
+++ b/legacy/README.md