Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSeg into develop

2213cff9 · chulutao · c8d22fae · d6eeca8c · 2213cff9 · 2213cff9
36 changed file
--- a/README.md
+++ b/README.md
@@ -94,6 +94,7 @@ pip install -r requirements.txt
 * [ICNet模型使用教程](./turtorial/finetune_icnet.md)
 * [PSPNet模型使用教程](./turtorial/finetune_pspnet.md)
 * [HRNet模型使用教程](./turtorial/finetune_hrnet.md)
+* [Fast-SCNN模型使用教程](./turtorial/finetune_fast_scnn.md)
 ### 预测部署
@@ -109,7 +110,7 @@ pip install -r requirements.txt
 * [如何解决二分类中类别不均衡问题](./docs/loss_select.md)
 * [特色垂类模型使用](./contrib)
 * [多进程训练和混合精度训练](./docs/multiple_gpus_train_and_mixed_precision_train.md)
+* 使用PaddleSlim进行分割模型压缩([量化](./slim/quantization/README.md), [蒸馏](./slim/distillation/README.md), [剪枝](./slim/prune/README.md), [搜索](./slim/nas/README.md))
 ## 在线体验
 我们在AI Studio平台上提供了在线体验的教程，欢迎体验：

--- a/configs/cityscape_fast_scnn.yaml
+++ b/configs/cityscape_fast_scnn.yaml
+EVAL_CROP_SIZE: (2048, 1024) # (width, height), for unpadding rangescaling and stepscaling
+TRAIN_CROP_SIZE: (1024, 1024) # (width, height), for unpadding rangescaling and stepscaling
+AUG:
+    AUG_METHOD: "stepscaling" # choice unpadding rangescaling and stepscaling
+    FIX_RESIZE_SIZE: (640, 640) # (width, height), for unpadding
+    INF_RESIZE_VALUE: 500  # for rangescaling
+    MAX_RESIZE_VALUE: 600  # for rangescaling
+    MIN_RESIZE_VALUE: 400  # for rangescaling
+    MAX_SCALE_FACTOR: 2.0  # for stepscaling
+    MIN_SCALE_FACTOR: 0.5  # for stepscaling
+    SCALE_STEP_SIZE: 0.25  # for stepscaling
+    MIRROR: True
+    FLIP: False
+    FLIP_RATIO: 0.2
+    RICH_CROP:
+        ENABLE: True
+        ASPECT_RATIO: 0.0
+        BLUR: False
+        BLUR_RATIO: 0.1
+        MAX_ROTATION: 0
+        MIN_AREA_RATIO: 0.0
+        BRIGHTNESS_JITTER_RATIO: 0.4
+        CONTRAST_JITTER_RATIO: 0.4
+        SATURATION_JITTER_RATIO: 0.4
+BATCH_SIZE: 12
+MEAN: [0.5, 0.5, 0.5]
+STD: [0.5, 0.5, 0.5]
+DATASET:
+    DATA_DIR: "./dataset/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "dataset/cityscapes/val.list"
+    TRAIN_FILE_LIST: "dataset/cityscapes/train.list"
+    VAL_FILE_LIST: "dataset/cityscapes/val.list"
+    IGNORE_INDEX: 255
+FREEZE:
+    MODEL_FILENAME: "model"
+    PARAMS_FILENAME: "params"
+MODEL:
+    DEFAULT_NORM_TYPE: "bn"
+    MODEL_NAME: "fast_scnn"
+TEST:
+    TEST_MODEL: "snapshots/cityscape_fast_scnn/final/"
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/cityscape_fast_scnn/"
+    SNAPSHOT_EPOCH: 10
+SOLVER:
+    LR: 0.001
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 100
--- a/configs/fast_scnn_pet.yaml
+++ b/configs/fast_scnn_pet.yaml
+TRAIN_CROP_SIZE: (512, 512) # (width, height), for unpadding rangescaling and stepscaling
+EVAL_CROP_SIZE: (512, 512) # (width, height), for unpadding rangescaling and stepscaling
+AUG:
+    AUG_METHOD: "unpadding" # choice unpadding rangescaling and stepscaling
+    FIX_RESIZE_SIZE: (512, 512) # (width, height), for unpadding
+    INF_RESIZE_VALUE: 500  # for rangescaling
+    MAX_RESIZE_VALUE: 600  # for rangescaling
+    MIN_RESIZE_VALUE: 400  # for rangescaling
+    MAX_SCALE_FACTOR: 1.25  # for stepscaling
+    MIN_SCALE_FACTOR: 0.75  # for stepscaling
+    SCALE_STEP_SIZE: 0.25  # for stepscaling
+    MIRROR: True
+BATCH_SIZE: 4
+DATASET:
+    DATA_DIR: "./dataset/mini_pet/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 3
+    TEST_FILE_LIST: "./dataset/mini_pet/file_list/test_list.txt"
+    TRAIN_FILE_LIST: "./dataset/mini_pet/file_list/train_list.txt"
+    VAL_FILE_LIST: "./dataset/mini_pet/file_list/val_list.txt"
+    VIS_FILE_LIST: "./dataset/mini_pet/file_list/test_list.txt"
+    IGNORE_INDEX: 255
+    SEPARATOR: " "
+FREEZE:
+    MODEL_FILENAME: "__model__"
+    PARAMS_FILENAME: "__params__"
+MODEL:
+    MODEL_NAME: "fast_scnn"
+    DEFAULT_NORM_TYPE: "bn"
+TRAIN:
+    PRETRAINED_MODEL_DIR: "./pretrained_model/fast_scnn_cityscape/"
+    MODEL_SAVE_DIR: "./saved_model/fast_scnn_pet/"
+    SNAPSHOT_EPOCH: 10
+TEST:
+    TEST_MODEL: "./saved_model/fast_scnn_pet/final"
+SOLVER:
+    NUM_EPOCHS: 100
+    LR: 0.005
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
--- a/docs/model_zoo.md
+++ b/docs/model_zoo.md
@@ -63,3 +63,6 @@ train数据集合为Cityscapes训练集合，测试为Cityscapes的验证集合
 | PSPNet/bn | Cityscapes |[pspnet50_cityscapes.tgz](https://paddleseg.bj.bcebos.com/models/pspnet50_cityscapes.tgz) |16|false| 0.7013 |
 | PSPNet/bn | Cityscapes |[pspnet101_cityscapes.tgz](https://paddleseg.bj.bcebos.com/models/pspnet101_cityscapes.tgz) |16|false| 0.7734 |
 | HRNet_W18/bn | Cityscapes |[hrnet_w18_bn_cityscapes.tgz](https://paddleseg.bj.bcebos.com/models/hrnet_w18_bn_cityscapes.tgz) | 4 | false | 0.7936 |
+| Fast-SCNN/bn | Cityscapes |[fast_scnn_cityscapes.tar](https://paddleseg.bj.bcebos.com/models/fast_scnn_cityscape.tar) | 32 | false | 0.6964 |
+测试环境为python 3.7.3，v100，cudnn 7.6.2。
--- a/docs/multiple_gpus_train_and_mixed_precision_train.md
+++ b/docs/multiple_gpus_train_and_mixed_precision_train.md
@@ -4,7 +4,7 @@
 * PaddlePaddle >= 1.6.1
 * NVIDIA NCCL >= 2.4.7
-环境配置，数据，预训练模型准备等工作请参考[安装说明](./installation.md)，[PaddleSeg使用说明](./usage.md)
+环境配置，数据，预训练模型准备等工作请参考[PaddleSeg使用说明](./usage.md)
 ### 多进程训练示例

--- a/pdseg/__init__.py
+++ b/pdseg/__init__.py
@@ -14,4 +14,4 @@
 # limitations under the License.
 import models
 import utils
-import tools
+from . import tools
\ No newline at end of file
--- a/pdseg/loss.py
+++ b/pdseg/loss.py
@@ -71,6 +71,7 @@ def softmax_with_loss(logit, label, ignore_mask=None, num_classes=2, weight=None
    ignore_mask.stop_gradient = True
    return avg_loss
 # to change, how to appicate ignore index and ignore mask
 def dice_loss(logit, label, ignore_mask=None, epsilon=0.00001):
    if logit.shape[1] != 1 or label.shape[1] != 1 or ignore_mask.shape[1] != 1:
@@ -93,6 +94,7 @@ def dice_loss(logit, label, ignore_mask=None, epsilon=0.00001):
    ignore_mask.stop_gradient = True
    return fluid.layers.reduce_mean(dice_score)
 def bce_loss(logit, label, ignore_mask=None):
    if logit.shape[1] != 1 or label.shape[1] != 1 or ignore_mask.shape[1] != 1:
        raise Exception("bce loss is only applicable to binary classfication")
@@ -112,16 +114,18 @@ def multi_softmax_with_loss(logits, label, ignore_mask=None, num_classes=2, weig
    if isinstance(logits, tuple):
        avg_loss = 0
        for i, logit in enumerate(logits):
-            logit_label = fluid.layers.resize_nearest(label, logit.shape[2:])
+            if label.shape[2] != logit.shape[2] or label.shape[3] != logit.shape[3]:
-            logit_mask = (logit_label.astype('int32') !=
+                label = fluid.layers.resize_nearest(label, logit.shape[2:])
+            logit_mask = (label.astype('int32') !=
                          cfg.DATASET.IGNORE_INDEX).astype('int32')
-            loss = softmax_with_loss(logit, logit_label, logit_mask,
+            loss = softmax_with_loss(logit, label, logit_mask,
                                     num_classes)
            avg_loss += cfg.MODEL.MULTI_LOSS_WEIGHT[i] * loss
    else:
        avg_loss = softmax_with_loss(logits, label, ignore_mask, num_classes, weight=weight)
    return avg_loss
 def multi_dice_loss(logits, label, ignore_mask=None):
    if isinstance(logits, tuple):
        avg_loss = 0
@@ -135,6 +139,7 @@ def multi_dice_loss(logits, label, ignore_mask=None):
        avg_loss = dice_loss(logits, label, ignore_mask)
    return avg_loss
 def multi_bce_loss(logits, label, ignore_mask=None):
    if isinstance(logits, tuple):
        avg_loss = 0

--- a/pdseg/models/libs/model_libs.py
+++ b/pdseg/models/libs/model_libs.py
@@ -164,3 +164,37 @@ def separate_conv(input, channel, stride, filter, dilation=1, act=None):
        input = bn(input)
        if act: input = act(input)
    return input
+def conv_bn_layer(input,
+                  filter_size,
+                  num_filters,
+                  stride,
+                  padding,
+                  channels=None,
+                  num_groups=1,
+                  if_act=True,
+                  name=None,
+                  use_cudnn=True):
+    conv = fluid.layers.conv2d(
+        input=input,
+        num_filters=num_filters,
+        filter_size=filter_size,
+        stride=stride,
+        padding=padding,
+        groups=num_groups,
+        act=None,
+        use_cudnn=use_cudnn,
+        param_attr=fluid.ParamAttr(name=name + '_weights'),
+        bias_attr=False)
+    bn_name = name + '_bn'
+    bn = fluid.layers.batch_norm(
+        input=conv,
+        param_attr=fluid.ParamAttr(name=bn_name + "_scale"),
+        bias_attr=fluid.ParamAttr(name=bn_name + "_offset"),
+        moving_mean_name=bn_name + '_mean',
+        moving_variance_name=bn_name + '_variance')
+    if if_act:
+        return fluid.layers.relu6(bn)
+    else:
+        return bn
\ No newline at end of file
--- a/pdseg/models/model_builder.py
+++ b/pdseg/models/model_builder.py
@@ -24,7 +24,7 @@ from utils.config import cfg
 from loss import multi_softmax_with_loss
 from loss import multi_dice_loss
 from loss import multi_bce_loss
-from models.modeling import deeplab, unet, icnet, pspnet, hrnet
+from models.modeling import deeplab, unet, icnet, pspnet, hrnet, fast_scnn
 class ModelPhase(object):
@@ -81,6 +81,8 @@ def seg_model(image, class_num):
        logits = pspnet.pspnet(image, class_num)
    elif model_name == 'hrnet':
        logits = hrnet.hrnet(image, class_num)
+    elif model_name == 'fast_scnn':
+        logits = fast_scnn.fast_scnn(image, class_num)
    else:
        raise Exception(
            "unknow model name, only support unet, deeplabv3p, icnet, pspnet, hrnet"

--- a/pdseg/models/modeling/deeplab.py
+++ b/pdseg/models/modeling/deeplab.py
@@ -27,6 +27,7 @@ from models.libs.model_libs import separate_conv
 from models.backbone.mobilenet_v2 import MobileNetV2 as mobilenet_backbone
 from models.backbone.xception import Xception as xception_backbone
 def encoder(input):
    # 编码器配置，采用ASPP架构，pooling + 1x1_conv + 三个不同尺度的空洞卷积并行, concat后1x1conv
    # ASPP_WITH_SEP_CONV：默认为真，使用depthwise可分离卷积，否则使用普通卷积
@@ -47,8 +48,7 @@ def encoder(input):
    with scope('encoder'):
        channel = 256
        with scope("image_pool"):
-            image_avg = fluid.layers.reduce_mean(
+            image_avg = fluid.layers.reduce_mean(input, [2, 3], keep_dim=True)
-                input, [2, 3], keep_dim=True)
            image_avg = bn_relu(
                conv(
                    image_avg,
@@ -250,14 +250,15 @@ def deeplabv3p(img, num_classes):
            regularization_coeff=0.0),
        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.01))
    with scope('logit'):
-        logit = conv(
+        with fluid.name_scope('last_conv'):
-            data,
+            logit = conv(
-            num_classes,
+                data,
-            1,
+                num_classes,
-            stride=1,
+                1,
-            padding=0,
+                stride=1,
-            bias_attr=True,
+                padding=0,
-            param_attr=param_attr)
+                bias_attr=True,
+                param_attr=param_attr)
        logit = fluid.layers.resize_bilinear(logit, img.shape[2:])
    return logit
--- a/pdseg/models/modeling/fast_scnn.py
+++ b/pdseg/models/modeling/fast_scnn.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import paddle.fluid as fluid
+from models.libs.model_libs import scope
+from models.libs.model_libs import bn, bn_relu, relu, conv_bn_layer
+from models.libs.model_libs import conv, avg_pool
+from models.libs.model_libs import separate_conv
+from utils.config import cfg
+def learning_to_downsample(x, dw_channels1=32, dw_channels2=48, out_channels=64):
+    x = relu(bn(conv(x, dw_channels1, 3, 2)))
+    with scope('dsconv1'):
+        x = separate_conv(x, dw_channels2, stride=2, filter=3, act=fluid.layers.relu)
+    with scope('dsconv2'):
+        x = separate_conv(x, out_channels, stride=2, filter=3, act=fluid.layers.relu)
+    return x
+def shortcut(input, data_residual):
+    return fluid.layers.elementwise_add(input, data_residual)
+def dropout2d(input, prob, is_train=False):
+    if not is_train:
+        return input
+    channels = input.shape[1]
+    keep_prob = 1.0 - prob
+    random_tensor = keep_prob + fluid.layers.uniform_random_batch_size_like(input, [-1, channels, 1, 1], min=0., max=1.)
+    binary_tensor = fluid.layers.floor(random_tensor)
+    output = input / keep_prob * binary_tensor
+    return output
+def inverted_residual_unit(input,
+                           num_in_filter,
+                           num_filters,
+                           ifshortcut,
+                           stride,
+                           filter_size,
+                           padding,
+                           expansion_factor,
+                           name=None):
+    num_expfilter = int(round(num_in_filter * expansion_factor))
+    channel_expand = conv_bn_layer(
+        input=input,
+        num_filters=num_expfilter,
+        filter_size=1,
+        stride=1,
+        padding=0,
+        num_groups=1,
+        if_act=True,
+        name=name + '_expand')
+    bottleneck_conv = conv_bn_layer(
+        input=channel_expand,
+        num_filters=num_expfilter,
+        filter_size=filter_size,
+        stride=stride,
+        padding=padding,
+        num_groups=num_expfilter,
+        if_act=True,
+        name=name + '_dwise',
+        use_cudnn=False)
+    depthwise_output = bottleneck_conv
+    linear_out = conv_bn_layer(
+        input=bottleneck_conv,
+        num_filters=num_filters,
+        filter_size=1,
+        stride=1,
+        padding=0,
+        num_groups=1,
+        if_act=False,
+        name=name + '_linear')
+    if ifshortcut:
+        out = shortcut(input=input, data_residual=linear_out)
+        return out, depthwise_output
+    else:
+        return linear_out, depthwise_output
+def inverted_blocks(input, in_c, t, c, n, s, name=None):
+    first_block, depthwise_output = inverted_residual_unit(
+        input=input,
+        num_in_filter=in_c,
+        num_filters=c,
+        ifshortcut=False,
+        stride=s,
+        filter_size=3,
+        padding=1,
+        expansion_factor=t,
+        name=name + '_1')
+    last_residual_block = first_block
+    last_c = c
+    for i in range(1, n):
+        last_residual_block, depthwise_output = inverted_residual_unit(
+            input=last_residual_block,
+            num_in_filter=last_c,
+            num_filters=c,
+            ifshortcut=True,
+            stride=1,
+            filter_size=3,
+            padding=1,
+            expansion_factor=t,
+            name=name + '_' + str(i + 1))
+    return last_residual_block, depthwise_output
+def psp_module(input, out_features):
+    cat_layers = []
+    sizes = (1, 2, 3, 6)
+    for size in sizes:
+        psp_name = "psp" + str(size)
+        with scope(psp_name):
+            pool = fluid.layers.adaptive_pool2d(input,
+                                                pool_size=[size, size],
+                                                pool_type='avg',
+                                                name=psp_name + '_adapool')
+            data = conv(pool, out_features,
+                        filter_size=1,
+                        bias_attr=False,
+                        name=psp_name + '_conv')
+            data_bn = bn(data, act='relu')
+            interp = fluid.layers.resize_bilinear(data_bn,
+                                                  out_shape=input.shape[2:],
+                                                  name=psp_name + '_interp', align_mode=0)
+        cat_layers.append(interp)
+    cat_layers = [input] + cat_layers
+    out = fluid.layers.concat(cat_layers, axis=1, name='psp_cat')
+    return out
+class FeatureFusionModule:
+    """Feature fusion module"""
+    def __init__(self, higher_in_channels, lower_in_channels, out_channels, scale_factor=4):
+        self.higher_in_channels = higher_in_channels
+        self.lower_in_channels = lower_in_channels
+        self.out_channels = out_channels
+        self.scale_factor = scale_factor
+    def net(self, higher_res_feature, lower_res_feature):
+        h, w = higher_res_feature.shape[2:]
+        lower_res_feature = fluid.layers.resize_bilinear(lower_res_feature, [h, w], align_mode=0)
+        with scope('dwconv'):
+            lower_res_feature = relu(bn(conv(lower_res_feature, self.out_channels, 1)))#(lower_res_feature)
+        with scope('conv_lower_res'):
+            lower_res_feature = bn(conv(lower_res_feature, self.out_channels, 1, bias_attr=True))
+        with scope('conv_higher_res'):
+            higher_res_feature = bn(conv(higher_res_feature, self.out_channels, 1, bias_attr=True))
+        out = higher_res_feature + lower_res_feature
+        return relu(out)
+class GlobalFeatureExtractor():
+    """Global feature extractor module"""
+    def __init__(self, in_channels=64, block_channels=(64, 96, 128), out_channels=128,
+                 t=6, num_blocks=(3, 3, 3)):
+        self.in_channels = in_channels
+        self.block_channels = block_channels
+        self.out_channels = out_channels
+        self.t = t
+        self.num_blocks = num_blocks
+    def net(self, x):
+        x, _ = inverted_blocks(x, self.in_channels, self.t, self.block_channels[0],
+                               self.num_blocks[0], 2, 'inverted_block_1')
+        x, _ = inverted_blocks(x, self.block_channels[0], self.t, self.block_channels[1],
+                               self.num_blocks[1], 2, 'inverted_block_2')
+        x, _ = inverted_blocks(x, self.block_channels[1], self.t, self.block_channels[2],
+                               self.num_blocks[2], 1, 'inverted_block_3')
+        x = psp_module(x, self.block_channels[2] // 4)
+        with scope('out'):
+            x = relu(bn(conv(x, self.out_channels, 1)))
+        return x
+class Classifier:
+    """Classifier"""
+    def __init__(self, dw_channels, num_classes, stride=1):
+        self.dw_channels = dw_channels
+        self.num_classes = num_classes
+        self.stride = stride
+    def net(self, x):
+        with scope('dsconv1'):
+            x = separate_conv(x, self.dw_channels, stride=self.stride, filter=3, act=fluid.layers.relu)
+        with scope('dsconv2'):
+            x = separate_conv(x, self.dw_channels, stride=self.stride, filter=3, act=fluid.layers.relu)
+        x = dropout2d(x, 0.1, is_train=cfg.PHASE=='train')
+        x = conv(x, self.num_classes, 1, bias_attr=True)
+        return x
+def aux_layer(x, num_classes):
+    x = relu(bn(conv(x, 32, 3, padding=1)))
+    x = dropout2d(x, 0.1, is_train=(cfg.PHASE == 'train'))
+    with scope('logit'):
+        x = conv(x, num_classes, 1, bias_attr=True)
+    return x
+def fast_scnn(img, num_classes):
+    size = img.shape[2:]
+    classifier = Classifier(128, num_classes)
+    global_feature_extractor = GlobalFeatureExtractor(64, [64, 96, 128], 128, 6, [3, 3, 3])
+    feature_fusion = FeatureFusionModule(64, 128, 128)
+    with scope('learning_to_downsample'):
+        higher_res_features = learning_to_downsample(img, 32, 48, 64)
+    with scope('global_feature_extractor'):
+        lower_res_feature = global_feature_extractor.net(higher_res_features)
+    with scope('feature_fusion'):
+        x = feature_fusion.net(higher_res_features, lower_res_feature)
+    with scope('classifier'):
+        logit = classifier.net(x)
+        logit = fluid.layers.resize_bilinear(logit, size, align_mode=0)
+    if len(cfg.MODEL.MULTI_LOSS_WEIGHT) == 3:
+        with scope('aux_layer_higher'):
+            higher_logit = aux_layer(higher_res_features, num_classes)
+            higher_logit = fluid.layers.resize_bilinear(higher_logit, size, align_mode=0)
+        with scope('aux_layer_lower'):
+            lower_logit = aux_layer(lower_res_feature, num_classes)
+            lower_logit = fluid.layers.resize_bilinear(lower_logit, size, align_mode=0)
+        return logit, higher_logit, lower_logit
+    elif len(cfg.MODEL.MULTI_LOSS_WEIGHT) == 2:
+        with scope('aux_layer_higher'):
+            higher_logit = aux_layer(higher_res_features, num_classes)
+            higher_logit = fluid.layers.resize_bilinear(higher_logit, size, align_mode=0)
+        return logit, higher_logit
+    return logit
\ No newline at end of file
--- a/pdseg/reader.py
+++ b/pdseg/reader.py
@@ -98,8 +98,8 @@ class SegDataset(object):
        # Re-shuffle file list
        if self.shuffle and cfg.NUM_TRAINERS > 1:
            np.random.RandomState(self.shuffle_seed).shuffle(self.all_lines)
-            num_lines = len(self.all_lines) // self.num_trainers
+            num_lines = len(self.all_lines) // cfg.NUM_TRAINERS
-            self.lines = self.all_lines[num_lines * self.trainer_id: num_lines * (self.trainer_id + 1)]
+            self.lines = self.all_lines[num_lines * cfg.TRAINER_ID: num_lines * (cfg.TRAINER_ID + 1)]
            self.shuffle_seed += 1
        elif self.shuffle:
            np.random.shuffle(self.lines)

--- a/pdseg/utils/config.py
+++ b/pdseg/utils/config.py
@@ -236,3 +236,19 @@ cfg.FREEZE.MODEL_FILENAME = '__model__'
 cfg.FREEZE.PARAMS_FILENAME = '__params__'
 # 预测模型参数保存的路径
 cfg.FREEZE.SAVE_DIR = 'freeze_model'
+########################## paddle-slim ######################################
+cfg.SLIM.KNOWLEDGE_DISTILL_IS_TEACHER = False
+cfg.SLIM.KNOWLEDGE_DISTILL = False
+cfg.SLIM.KNOWLEDGE_DISTILL_TEACHER_MODEL_DIR = ""
+cfg.SLIM.NAS_PORT = 23333
+cfg.SLIM.NAS_ADDRESS = ""
+cfg.SLIM.NAS_SEARCH_STEPS = 100
+cfg.SLIM.NAS_START_EVAL_EPOCH = 0
+cfg.SLIM.NAS_IS_SERVER = True
+cfg.SLIM.NAS_SPACE_NAME = ""
+cfg.SLIM.PRUNE_PARAMS = ''
+cfg.SLIM.PRUNE_RATIOS = []
--- a/pretrained_model/download_model.py
+++ b/pretrained_model/download_model.py
@@ -81,6 +81,8 @@ model_urls = {
    "https://paddleseg.bj.bcebos.com/models/pspnet101_cityscapes.tgz",
    "hrnet_w18_bn_cityscapes":
    "https://paddleseg.bj.bcebos.com/models/hrnet_w18_bn_cityscapes.tgz",
+    "fast_scnn_cityscapes":
+    "https://paddleseg.bj.bcebos.com/models/fast_scnn_cityscape.tar",
 }
 if __name__ == "__main__":

--- a/slim/distillation/README.md
+++ b/slim/distillation/README.md
+>运行该示例前请安装PaddleSlim和Paddle1.6或更高版本
+# PaddleSeg蒸馏教程
+在阅读本教程前，请确保您已经了解过[PaddleSeg使用说明](../../docs/usage.md)等章节，以便对PaddleSeg有一定的了解
+该文档介绍如何使用[PaddleSlim](https://paddlepaddle.github.io/PaddleSlim)对分割库中的模型进行蒸馏。
+该教程中所示操作，如无特殊说明，均在`PaddleSeg/`路径下执行。
+## 概述
+该示例使用PaddleSlim提供的[蒸馏策略](https://paddlepaddle.github.io/PaddleSlim/algo/algo/#3)对分割库中的模型进行蒸馏训练。
+在阅读该示例前，建议您先了解以下内容：
+- [PaddleSlim蒸馏API文档](https://paddlepaddle.github.io/PaddleSlim/api/single_distiller_api/)
+## 安装PaddleSlim
+可按照[PaddleSlim使用文档](https://paddlepaddle.github.io/PaddleSlim/)中的步骤安装PaddleSlim
+## 蒸馏策略说明
+关于蒸馏API如何使用您可以参考PaddleSlim蒸馏API文档
+这里以Deeplabv3-xception蒸馏训练Deeplabv3-mobilenet模型为例，首先，为了对`student model`和`teacher model`有个总体的认识，进一步确认蒸馏的对象，我们通过以下命令分别观察两个网络变量（Variables）的名称和形状：
+```python
+# 观察student model的Variables
+student_vars = []
+for v in fluid.default_main_program().list_vars():
+    try:
+        student_vars.append((v.name, v.shape))
+    except:
+        pass
+print("="*50+"student_model_vars"+"="*50)
+print(student_vars)
+# 观察teacher model的Variables
+teacher_vars = []
+for v in teacher_program.list_vars():
+    try:
+        teacher_vars.append((v.name, v.shape))
+    except:
+        pass
+print("="*50+"teacher_model_vars"+"="*50)
+print(teacher_vars)
+```
+经过对比可以发现，`student model`和`teacher model`输入到`loss`的特征图分别为：
+```bash
+# student model
+bilinear_interp_0.tmp_0
+# teacher model
+bilinear_interp_2.tmp_0
+```
+它们形状两两相同，且分别处于两个网络的输出部分。所以，我们用`l2_loss`对这几个特征图两两对应添加蒸馏loss。需要注意的是，teacher的Variable在merge过程中被自动添加了一个`name_prefix`，所以这里也需要加上这个前缀`"teacher_"`，merge过程请参考[蒸馏API文档](https://paddlepaddle.github.io/PaddleSlim/api/single_distiller_api/#merge)
+```python
+distill_loss = l2_loss('teacher_bilinear_interp_2.tmp_0', 'bilinear_interp_0.tmp_0')
+```
+我们也可以根据上述操作为蒸馏策略选择其他loss，PaddleSlim支持的有`FSP_loss`, `L2_loss`, `softmax_with_cross_entropy_loss` 以及自定义的任何loss。
+## 训练
+根据[PaddleSeg/pdseg/train.py](../../pdseg/train.py)编写压缩脚本`train_distill.py`。
+在该脚本中定义了teacher_model和student_model，用teacher_model的输出指导student_model的训练
+### 执行示例
+下载teacher的预训练模型([deeplabv3p_xception65_bn_cityscapes.tgz](https://paddleseg.bj.bcebos.com/models/xception65_bn_cityscapes.tgz))和student的预训练模型([mobilenet_cityscapes.tgz](https://paddleseg.bj.bcebos.com/models/mobilenet_cityscapes.tgz)), 
+修改student config file(./slim/distillation/cityscape.yaml)中预训练模型的路径:
+```
+TRAIN:
+    PRETRAINED_MODEL_DIR: your_student_pretrained_model_dir
+```
+修改teacher config file(./slim/distillation/cityscape_teacher.yaml)中预训练模型的路径:
+```
+SLIM:
+    KNOWLEDGE_DISTILL_TEACHER_MODEL_DIR: your_teacher_pretrained_model_dir
+```
+执行如下命令启动训练，每间隔```cfg.TRAIN.SNAPSHOT_EPOCH```会进行一次评估。
+```shell
+CUDA_VISIBLE_DEVICES=0,1 
+python -m paddle.distributed.launch ./slim/distillation/train_distill.py \
+--log_steps 10 --cfg ./slim/distillation/cityscape.yaml \
+--teacher_cfg ./slim/distillation/cityscape_teacher.yaml \
+--use_gpu \
+--use_mpio \
+--do_eval 
+```
+注意：如需修改配置文件中的参数，请在对应的配置文件中直接修改，暂不支持命令行输入覆盖。
+## 评估预测
+训练完成后的评估和预测请参考PaddleSeg的[快速入门](../../README.md#快速入门)和[基础功能](../../README.md#基础功能)等章节
--- a/slim/distillation/cityscape.yaml
+++ b/slim/distillation/cityscape.yaml
+EVAL_CROP_SIZE: (2049, 1025) # (width, height), for unpadding rangescaling and stepscaling
+TRAIN_CROP_SIZE: (769, 769) # (width, height), for unpadding rangescaling and stepscaling
+AUG:
+    AUG_METHOD: "stepscaling" # choice unpadding rangescaling and stepscaling
+    FIX_RESIZE_SIZE: (640, 640) # (width, height), for unpadding
+    INF_RESIZE_VALUE: 500  # for rangescaling
+    MAX_RESIZE_VALUE: 600  # for rangescaling
+    MIN_RESIZE_VALUE: 400  # for rangescaling
+    MAX_SCALE_FACTOR: 2.0  # for stepscaling
+    MIN_SCALE_FACTOR: 0.5  # for stepscaling
+    SCALE_STEP_SIZE: 0.25  # for stepscaling
+    MIRROR: True
+    FLIP: True
+    FLIP_RATIO: 0.2
+    RICH_CROP:
+        ENABLE: False
+        ASPECT_RATIO: 0.33
+        BLUR: True
+        BLUR_RATIO: 0.1
+        MAX_ROTATION: 15
+        MIN_AREA_RATIO: 0.5
+        BRIGHTNESS_JITTER_RATIO: 0.5
+        CONTRAST_JITTER_RATIO: 0.5
+        SATURATION_JITTER_RATIO: 0.5
+BATCH_SIZE: 16
+MEAN: [0.5, 0.5, 0.5]
+STD: [0.5, 0.5, 0.5]
+DATASET:
+    DATA_DIR: "./dataset/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "dataset/cityscapes/val.list"
+    TRAIN_FILE_LIST: "dataset/cityscapes/train.list"
+    VAL_FILE_LIST: "dataset/cityscapes/val.list"
+    IGNORE_INDEX: 255
+FREEZE:
+    MODEL_FILENAME: "model"
+    PARAMS_FILENAME: "params"
+MODEL:
+    DEFAULT_NORM_TYPE: "bn"
+    MODEL_NAME: "deeplabv3p"
+    DEEPLAB:
+        BACKBONE: "mobilenet"
+        ASPP_WITH_SEP_CONV: True
+        DECODER_USE_SEP_CONV: True
+        ENCODER_WITH_ASPP: False
+        ENABLE_DECODER: False
+TEST:
+    TEST_MODEL: "snapshots/cityscape_v5/final/"
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/cityscape_mbv2_kd_e100_1/"
+    PRETRAINED_MODEL_DIR: u"pretrained_model/mobilenet_cityscapes"
+    SNAPSHOT_EPOCH: 5
+    SYNC_BATCH_NORM: True
+SOLVER:
+    LR: 0.001
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 100
--- a/slim/distillation/cityscape_teacher.yaml
+++ b/slim/distillation/cityscape_teacher.yaml
+EVAL_CROP_SIZE: (2049, 1025) # (width, height), for unpadding rangescaling and stepscaling
+TRAIN_CROP_SIZE: (769, 769) # (width, height), for unpadding rangescaling and stepscaling
+AUG:
+    AUG_METHOD: "stepscaling" # choice unpadding rangescaling and stepscaling
+    FIX_RESIZE_SIZE: (640, 640) # (width, height), for unpadding
+    INF_RESIZE_VALUE: 500  # for rangescaling
+    MAX_RESIZE_VALUE: 600  # for rangescaling
+    MIN_RESIZE_VALUE: 400  # for rangescaling
+    MAX_SCALE_FACTOR: 2.0  # for stepscaling
+    MIN_SCALE_FACTOR: 0.5  # for stepscaling
+    SCALE_STEP_SIZE: 0.25  # for stepscaling
+    MIRROR: True
+    FLIP: True
+    FLIP_RATIO: 0.2
+    RICH_CROP:
+        ENABLE: False
+        ASPECT_RATIO: 0.33
+        BLUR: True
+        BLUR_RATIO: 0.1
+        MAX_ROTATION: 15
+        MIN_AREA_RATIO: 0.5
+        BRIGHTNESS_JITTER_RATIO: 0.5
+        CONTRAST_JITTER_RATIO: 0.5
+        SATURATION_JITTER_RATIO: 0.5
+BATCH_SIZE: 16
+MEAN: [0.5, 0.5, 0.5]
+STD: [0.5, 0.5, 0.5]
+DATASET:
+    DATA_DIR: "./dataset/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "dataset/cityscapes/val.list"
+    TRAIN_FILE_LIST: "dataset/cityscapes/train.list"
+    VAL_FILE_LIST: "dataset/cityscapes/val.list"
+    IGNORE_INDEX: 255
+FREEZE:
+    MODEL_FILENAME: "model"
+    PARAMS_FILENAME: "params"
+MODEL:
+    DEFAULT_NORM_TYPE: "bn"
+    MODEL_NAME: "deeplabv3p"
+    DEEPLAB:
+        BACKBONE: "xception_65"
+        ASPP_WITH_SEP_CONV: True
+        DECODER_USE_SEP_CONV: True
+        ENCODER_WITH_ASPP: True
+        ENABLE_DECODER: True
+TEST:
+    TEST_MODEL: "snapshots/cityscape_v5/final/"
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/cityscape_v7/"
+    PRETRAINED_MODEL_DIR: u"pretrain/deeplabv3plus_gn_init"
+    SNAPSHOT_EPOCH: 5
+    SYNC_BATCH_NORM: True
+SOLVER:
+    LR: 0.001
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 100
+SLIM:
+    KNOWLEDGE_DISTILL_IS_TEACHER: True
+    KNOWLEDGE_DISTILL: True
+    KNOWLEDGE_DISTILL_TEACHER_MODEL_DIR: "pretrained_model/xception65_bn_cityscapes"
--- a/slim/distillation/model_builder.py
+++ b/slim/distillation/model_builder.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import struct
+import paddle.fluid as fluid
+import numpy as np
+from paddle.fluid.proto.framework_pb2 import VarType
+import solver
+from utils.config import cfg
+from loss import multi_softmax_with_loss
+from loss import multi_dice_loss
+from loss import multi_bce_loss
+from models.modeling import deeplab, unet, icnet, pspnet, hrnet, fast_scnn
+class ModelPhase(object):
+    """
+    Standard name for model phase in PaddleSeg
+    The following standard keys are defined:
+    * `TRAIN`: training mode.
+    * `EVAL`: testing/evaluation mode.
+    * `PREDICT`: prediction/inference mode.
+    * `VISUAL` : visualization mode
+    """
+    TRAIN = 'train'
+    EVAL = 'eval'
+    PREDICT = 'predict'
+    VISUAL = 'visual'
+    @staticmethod
+    def is_train(phase):
+        return phase == ModelPhase.TRAIN
+    @staticmethod
+    def is_predict(phase):
+        return phase == ModelPhase.PREDICT
+    @staticmethod
+    def is_eval(phase):
+        return phase == ModelPhase.EVAL
+    @staticmethod
+    def is_visual(phase):
+        return phase == ModelPhase.VISUAL
+    @staticmethod
+    def is_valid_phase(phase):
+        """ Check valid phase """
+        if ModelPhase.is_train(phase) or ModelPhase.is_predict(phase) \
+                or ModelPhase.is_eval(phase) or ModelPhase.is_visual(phase):
+            return True
+        return False
+def seg_model(image, class_num):
+    model_name = cfg.MODEL.MODEL_NAME
+    if model_name == 'unet':
+        logits = unet.unet(image, class_num)
+    elif model_name == 'deeplabv3p':
+        logits = deeplab.deeplabv3p(image, class_num)
+    elif model_name == 'icnet':
+        logits = icnet.icnet(image, class_num)
+    elif model_name == 'pspnet':
+        logits = pspnet.pspnet(image, class_num)
+    elif model_name == 'hrnet':
+        logits = hrnet.hrnet(image, class_num)
+    elif model_name == 'fast_scnn':
+        logits = fast_scnn.fast_scnn(image, class_num)
+    else:
+        raise Exception(
+            "unknow model name, only support unet, deeplabv3p, icnet, pspnet, hrnet"
+        )
+    return logits
+def softmax(logit):
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.softmax(logit)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+def sigmoid_to_softmax(logit):
+    """
+    one channel to two channel
+    """
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.sigmoid(logit)
+    logit_back = 1 - logit
+    logit = fluid.layers.concat([logit_back, logit], axis=-1)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+def export_preprocess(image):
+    """导出模型的预处理流程"""
+    image = fluid.layers.transpose(image, [0, 3, 1, 2])
+    origin_shape = fluid.layers.shape(image)[-2:]
+    # 不同AUG_METHOD方法的resize
+    if cfg.AUG.AUG_METHOD == 'unpadding':
+        h_fix = cfg.AUG.FIX_RESIZE_SIZE[1]
+        w_fix = cfg.AUG.FIX_RESIZE_SIZE[0]
+        image = fluid.layers.resize_bilinear(
+            image, out_shape=[h_fix, w_fix], align_corners=False, align_mode=0)
+    elif cfg.AUG.AUG_METHOD == 'rangescaling':
+        size = cfg.AUG.INF_RESIZE_VALUE
+        value = fluid.layers.reduce_max(origin_shape)
+        scale = float(size) / value.astype('float32')
+        image = fluid.layers.resize_bilinear(
+            image, scale=scale, align_corners=False, align_mode=0)
+    # 存储resize后图像shape
+    valid_shape = fluid.layers.shape(image)[-2:]
+    # padding到eval_crop_size大小
+    width = cfg.EVAL_CROP_SIZE[0]
+    height = cfg.EVAL_CROP_SIZE[1]
+    pad_target = fluid.layers.assign(
+        np.array([height, width]).astype('float32'))
+    up = fluid.layers.assign(np.array([0]).astype('float32'))
+    down = pad_target[0] - valid_shape[0]
+    left = up
+    right = pad_target[1] - valid_shape[1]
+    paddings = fluid.layers.concat([up, down, left, right])
+    paddings = fluid.layers.cast(paddings, 'int32')
+    image = fluid.layers.pad2d(image, paddings=paddings, pad_value=127.5)
+    # normalize
+    mean = np.array(cfg.MEAN).reshape(1, len(cfg.MEAN), 1, 1)
+    mean = fluid.layers.assign(mean.astype('float32'))
+    std = np.array(cfg.STD).reshape(1, len(cfg.STD), 1, 1)
+    std = fluid.layers.assign(std.astype('float32'))
+    image = (image / 255 - mean) / std
+    # 使后面的网络能通过类似image.shape获取特征图的shape
+    image = fluid.layers.reshape(
+        image, shape=[-1, cfg.DATASET.DATA_DIM, height, width])
+    return image, valid_shape, origin_shape
+def build_model(main_prog=None, start_prog=None, phase=ModelPhase.TRAIN, **kwargs):
+    if not ModelPhase.is_valid_phase(phase):
+        raise ValueError("ModelPhase {} is not valid!".format(phase))
+    if ModelPhase.is_train(phase):
+        width = cfg.TRAIN_CROP_SIZE[0]
+        height = cfg.TRAIN_CROP_SIZE[1]
+    else:
+        width = cfg.EVAL_CROP_SIZE[0]
+        height = cfg.EVAL_CROP_SIZE[1]
+    image_shape = [cfg.DATASET.DATA_DIM, height, width]
+    grt_shape = [1, height, width]
+    class_num = cfg.DATASET.NUM_CLASSES
+    #with fluid.program_guard(main_prog, start_prog):
+    #    with fluid.unique_name.guard():
+    # 在导出模型的时候，增加图像标准化预处理,减小预测部署时图像的处理流程
+    # 预测部署时只须对输入图像增加batch_size维度即可
+    if cfg.SLIM.KNOWLEDGE_DISTILL_IS_TEACHER:
+        image = main_prog.global_block()._clone_variable(kwargs['image'],
+                                                         force_persistable=False)
+        label = main_prog.global_block()._clone_variable(kwargs['label'],
+                                                         force_persistable=False)
+        mask = main_prog.global_block()._clone_variable(kwargs['mask'],
+                                                        force_persistable=False)
+    else:
+        if ModelPhase.is_predict(phase):
+            origin_image = fluid.layers.data(
+                name='image',
+                shape=[-1, -1, -1, cfg.DATASET.DATA_DIM],
+                dtype='float32',
+                append_batch_size=False)
+            image, valid_shape, origin_shape = export_preprocess(
+                origin_image)
+        else:
+            image = fluid.layers.data(
+                name='image', shape=image_shape, dtype='float32')
+        label = fluid.layers.data(
+            name='label', shape=grt_shape, dtype='int32')
+        mask = fluid.layers.data(
+            name='mask', shape=grt_shape, dtype='int32')
+    # use PyReader when doing traning and evaluation
+    if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+        py_reader = None
+        if not cfg.SLIM.KNOWLEDGE_DISTILL_IS_TEACHER:
+            py_reader = fluid.io.PyReader(
+                feed_list=[image, label, mask],
+                capacity=cfg.DATALOADER.BUF_SIZE,
+                iterable=False,
+                use_double_buffer=True)
+    loss_type = cfg.SOLVER.LOSS
+    if not isinstance(loss_type, list):
+        loss_type = list(loss_type)
+    # dice_loss或bce_loss只适用两类分割中
+    if class_num > 2 and (("dice_loss" in loss_type) or
+                          ("bce_loss" in loss_type)):
+        raise Exception(
+            "dice loss and bce loss is only applicable to binary classfication"
+        )
+    # 在两类分割情况下，当loss函数选择dice_loss或bce_loss的时候，最后logit输出通道数设置为1
+    if ("dice_loss" in loss_type) or ("bce_loss" in loss_type):
+        class_num = 1
+        if "softmax_loss" in loss_type:
+            raise Exception(
+                "softmax loss can not combine with dice loss or bce loss"
+            )
+    logits = seg_model(image, class_num)
+    # 根据选择的loss函数计算相应的损失函数
+    if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+        loss_valid = False
+        avg_loss_list = []
+        valid_loss = []
+        if "softmax_loss" in loss_type:
+            weight = cfg.SOLVER.CROSS_ENTROPY_WEIGHT
+            avg_loss_list.append(
+                multi_softmax_with_loss(logits, label, mask, class_num, weight))
+            loss_valid = True
+            valid_loss.append("softmax_loss")
+        if "dice_loss" in loss_type:
+            avg_loss_list.append(multi_dice_loss(logits, label, mask))
+            loss_valid = True
+            valid_loss.append("dice_loss")
+        if "bce_loss" in loss_type:
+            avg_loss_list.append(multi_bce_loss(logits, label, mask))
+            loss_valid = True
+            valid_loss.append("bce_loss")
+        if not loss_valid:
+            raise Exception(
+                "SOLVER.LOSS: {} is set wrong. it should "
+                "include one of (softmax_loss, bce_loss, dice_loss) at least"
+                " example: ['softmax_loss'], ['dice_loss'], ['bce_loss', 'dice_loss']"
+                .format(cfg.SOLVER.LOSS))
+        invalid_loss = [x for x in loss_type if x not in valid_loss]
+        if len(invalid_loss) > 0:
+            print(
+                "Warning: the loss {} you set is invalid. it will not be included in loss computed."
+                .format(invalid_loss))
+        avg_loss = 0
+        for i in range(0, len(avg_loss_list)):
+            avg_loss += avg_loss_list[i]
+    #get pred result in original size
+    if isinstance(logits, tuple):
+        logit = logits[0]
+    else:
+        logit = logits
+    if logit.shape[2:] != label.shape[2:]:
+        logit = fluid.layers.resize_bilinear(logit, label.shape[2:])
+    # return image input and logit output for inference graph prune
+    if ModelPhase.is_predict(phase):
+        # 两类分割中，使用dice_loss或bce_loss返回的logit为单通道，进行到两通道的变换
+        if class_num == 1:
+            logit = sigmoid_to_softmax(logit)
+        else:
+            logit = softmax(logit)
+        # 获取有效部分
+        logit = fluid.layers.slice(
+            logit, axes=[2, 3], starts=[0, 0], ends=valid_shape)
+        logit = fluid.layers.resize_bilinear(
+            logit,
+            out_shape=origin_shape,
+            align_corners=False,
+            align_mode=0)
+        logit = fluid.layers.argmax(logit, axis=1)
+        return origin_image, logit
+    if class_num == 1:
+        out = sigmoid_to_softmax(logit)
+        out = fluid.layers.transpose(out, [0, 2, 3, 1])
+    else:
+        out = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    pred = fluid.layers.argmax(out, axis=3)
+    pred = fluid.layers.unsqueeze(pred, axes=[3])
+    if ModelPhase.is_visual(phase):
+        if class_num == 1:
+            logit = sigmoid_to_softmax(logit)
+        else:
+            logit = softmax(logit)
+        return pred, logit
+    if ModelPhase.is_eval(phase):
+        return py_reader, avg_loss, pred, label, mask
+    if ModelPhase.is_train(phase):
+        decayed_lr = None
+        if not cfg.SLIM.KNOWLEDGE_DISTILL:
+            optimizer = solver.Solver(main_prog, start_prog)
+            decayed_lr = optimizer.optimise(avg_loss)
+        # optimizer = solver.Solver(main_prog, start_prog)
+        # decayed_lr = optimizer.optimise(avg_loss)
+        return py_reader, avg_loss, decayed_lr, pred, label, mask, image
+def to_int(string, dest="I"):
+    return struct.unpack(dest, string)[0]
+def parse_shape_from_file(filename):
+    with open(filename, "rb") as file:
+        version = file.read(4)
+        lod_level = to_int(file.read(8), dest="Q")
+        for i in range(lod_level):
+            _size = to_int(file.read(8), dest="Q")
+            _ = file.read(_size)
+        version = file.read(4)
+        tensor_desc_size = to_int(file.read(4))
+        tensor_desc = VarType.TensorDesc()
+        tensor_desc.ParseFromString(file.read(tensor_desc_size))
+    return tuple(tensor_desc.dims)
--- a/slim/distillation/train_distill.py
+++ b/slim/distillation/train_distill.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+SEG_PATH = os.path.join(LOCAL_PATH, "../../", "pdseg")
+sys.path.append(SEG_PATH)
+import argparse
+import pprint
+import random
+import shutil
+import functools
+import paddle
+import numpy as np
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from metrics import ConfusionMatrix
+from reader import SegDataset
+from model_builder import build_model
+from model_builder import ModelPhase
+from model_builder import parse_shape_from_file
+from eval import evaluate
+from vis import visualize
+from utils import dist_utils
+import solver
+from paddleslim.dist.single_distiller import merge, l2_loss
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg training')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--teacher_cfg',
+        dest='teacher_cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess I/O or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--log_steps',
+        dest='log_steps',
+        help='Display logging information at every log_steps',
+        default=10,
+        type=int)
+    parser.add_argument(
+        '--debug',
+        dest='debug',
+        help='debug mode, display detail information of training',
+        action='store_true')
+    parser.add_argument(
+        '--use_tb',
+        dest='use_tb',
+        help='whether to record the data during training to Tensorboard',
+        action='store_true')
+    parser.add_argument(
+        '--tb_log_dir',
+        dest='tb_log_dir',
+        help='Tensorboard logging directory',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--do_eval',
+        dest='do_eval',
+        help='Evaluation models result on every new checkpoint',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        '--enable_ce',
+        dest='enable_ce',
+        help='If set True, enable continuous evaluation job.'
+        'This flag is only used for internal test.',
+        action='store_true')
+    return parser.parse_args()
+def save_vars(executor, dirname, program=None, vars=None):
+    """
+    Temporary resolution for Win save variables compatability.
+    Will fix in PaddlePaddle v1.5.2
+    """
+    save_program = fluid.Program()
+    save_block = save_program.global_block()
+    for each_var in vars:
+        # NOTE: don't save the variable which type is RAW
+        if each_var.type == fluid.core.VarDesc.VarType.RAW:
+            continue
+        new_var = save_block.create_var(
+            name=each_var.name,
+            shape=each_var.shape,
+            dtype=each_var.dtype,
+            type=each_var.type,
+            lod_level=each_var.lod_level,
+            persistable=True)
+        file_path = os.path.join(dirname, new_var.name)
+        file_path = os.path.normpath(file_path)
+        save_block.append_op(
+            type='save',
+            inputs={'X': [new_var]},
+            outputs={},
+            attrs={'file_path': file_path})
+    executor.run(save_program)
+def save_checkpoint(exe, program, ckpt_name):
+    """
+    Save checkpoint for evaluation or resume training
+    """
+    ckpt_dir = os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, str(ckpt_name))
+    print("Save model checkpoint to {}".format(ckpt_dir))
+    if not os.path.isdir(ckpt_dir):
+        os.makedirs(ckpt_dir)
+    save_vars(
+        exe,
+        ckpt_dir,
+        program,
+        vars=list(filter(fluid.io.is_persistable, program.list_vars())))
+    return ckpt_dir
+def load_checkpoint(exe, program):
+    """
+    Load checkpoiont from pretrained model directory for resume training
+    """
+    print('Resume model training from:', cfg.TRAIN.RESUME_MODEL_DIR)
+    if not os.path.exists(cfg.TRAIN.RESUME_MODEL_DIR):
+        raise ValueError("TRAIN.PRETRAIN_MODEL {} not exist!".format(
+            cfg.TRAIN.RESUME_MODEL_DIR))
+    fluid.io.load_persistables(
+        exe, cfg.TRAIN.RESUME_MODEL_DIR, main_program=program)
+    model_path = cfg.TRAIN.RESUME_MODEL_DIR
+    # Check is path ended by path spearator
+    if model_path[-1] == os.sep:
+        model_path = model_path[0:-1]
+    epoch_name = os.path.basename(model_path)
+    # If resume model is final model
+    if epoch_name == 'final':
+        begin_epoch = cfg.SOLVER.NUM_EPOCHS
+    # If resume model path is end of digit, restore epoch status
+    elif epoch_name.isdigit():
+        epoch = int(epoch_name)
+        begin_epoch = epoch + 1
+    else:
+        raise ValueError("Resume model path is not valid!")
+    print("Model checkpoint loaded successfully!")
+    return begin_epoch
+def update_best_model(ckpt_dir):
+    best_model_dir = os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, 'best_model')
+    if os.path.exists(best_model_dir):
+        shutil.rmtree(best_model_dir)
+    shutil.copytree(ckpt_dir, best_model_dir)
+def print_info(*msg):
+    if cfg.TRAINER_ID == 0:
+        print(*msg)
+def train(cfg):
+    # startup_prog = fluid.Program()
+    # train_prog = fluid.Program()
+    drop_last = True
+    dataset = SegDataset(
+        file_list=cfg.DATASET.TRAIN_FILE_LIST,
+        mode=ModelPhase.TRAIN,
+        shuffle=True,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        if args.use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        batch_data = []
+        for b in data_gen:
+            batch_data.append(b)
+            if len(batch_data) == (cfg.BATCH_SIZE // cfg.NUM_TRAINERS):
+                for item in batch_data:
+                    yield item[0], item[1], item[2]
+                batch_data = []
+        # If use sync batch norm strategy, drop last batch if number of samples
+        # in batch_data is less then cfg.BATCH_SIZE to avoid NCCL hang issues
+        if not cfg.TRAIN.SYNC_BATCH_NORM:
+            for item in batch_data:
+                yield item[0], item[1], item[2]
+    # Get device environment
+    # places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # place = places[0]
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+    places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # Get number of GPU
+    dev_count = cfg.NUM_TRAINERS if cfg.NUM_TRAINERS > 1 else len(places)
+    print_info("#Device count: {}".format(dev_count))
+    # Make sure BATCH_SIZE can divided by GPU cards
+    assert cfg.BATCH_SIZE % dev_count == 0, (
+        'BATCH_SIZE:{} not divisble by number of GPUs:{}'.format(
+            cfg.BATCH_SIZE, dev_count))
+    # If use multi-gpu training mode, batch data will allocated to each GPU evenly
+    batch_size_per_dev = cfg.BATCH_SIZE // dev_count
+    print_info("batch_size_per_dev: {}".format(batch_size_per_dev))
+    py_reader, loss, lr, pred, grts, masks, image = build_model(phase=ModelPhase.TRAIN)
+    py_reader.decorate_sample_generator(
+        data_generator, batch_size=batch_size_per_dev, drop_last=drop_last)
+    exe = fluid.Executor(place)
+    cfg.update_from_file(args.teacher_cfg_file)
+    # teacher_arch = teacher_cfg.architecture
+    teacher_program = fluid.Program()
+    teacher_startup_program = fluid.Program()
+    with fluid.program_guard(teacher_program, teacher_startup_program):
+        with fluid.unique_name.guard():
+            _, teacher_loss, _, _, _, _, _ = build_model(
+                teacher_program, teacher_startup_program, phase=ModelPhase.TRAIN, image=image,
+                label=grts, mask=masks)
+    exe.run(teacher_startup_program)
+    teacher_program = teacher_program.clone(for_test=True)
+    ckpt_dir = cfg.SLIM.KNOWLEDGE_DISTILL_TEACHER_MODEL_DIR
+    assert ckpt_dir is not None
+    print('load teacher model:', ckpt_dir)
+    fluid.io.load_params(exe, ckpt_dir, main_program=teacher_program)
+    # cfg = load_config(FLAGS.config)
+    cfg.update_from_file(args.cfg_file)
+    data_name_map = {
+        'image': 'image',
+        'label': 'label',
+        'mask': 'mask',
+    }
+    merge(teacher_program, fluid.default_main_program(), data_name_map, place)
+    distill_pairs = [['teacher_bilinear_interp_2.tmp_0', 'bilinear_interp_0.tmp_0']]
+    def distill(pairs, weight):
+        """
+        Add 3 pairs of distillation losses, each pair of feature maps is the
+        input of teacher and student's yolov3_loss respectively
+        """
+        loss = l2_loss(pairs[0][0], pairs[0][1])
+        weighted_loss = loss * weight
+        return weighted_loss
+    distill_loss = distill(distill_pairs, 0.1)
+    cfg.update_from_file(args.cfg_file)
+    optimizer = solver.Solver(None, None)
+    all_loss = loss + distill_loss
+    lr = optimizer.optimise(all_loss)
+    exe.run(fluid.default_startup_program())
+    exec_strategy = fluid.ExecutionStrategy()
+    # Clear temporary variables every 100 iteration
+    if args.use_gpu:
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+    exec_strategy.num_iteration_per_drop_scope = 100
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.fuse_all_reduce_ops = False
+    build_strategy.fuse_all_optimizer_ops = False
+    build_strategy.fuse_elewise_add_act_ops = True
+    if cfg.NUM_TRAINERS > 1 and args.use_gpu:
+        dist_utils.prepare_for_multi_process(exe, build_strategy, fluid.default_main_program())
+        exec_strategy.num_threads = 1
+    if cfg.TRAIN.SYNC_BATCH_NORM and args.use_gpu:
+        if dev_count > 1:
+            # Apply sync batch norm strategy
+            print_info("Sync BatchNorm strategy is effective.")
+            build_strategy.sync_batch_norm = True
+        else:
+            print_info(
+                "Sync BatchNorm strategy will not be effective if GPU device"
+                " count <= 1")
+    compiled_train_prog = fluid.CompiledProgram(fluid.default_main_program()).with_data_parallel(
+        loss_name=all_loss.name,
+        exec_strategy=exec_strategy,
+        build_strategy=build_strategy)
+    # Resume training
+    begin_epoch = cfg.SOLVER.BEGIN_EPOCH
+    if cfg.TRAIN.RESUME_MODEL_DIR:
+        begin_epoch = load_checkpoint(exe, fluid.default_main_program())
+    # Load pretrained model
+    elif os.path.exists(cfg.TRAIN.PRETRAINED_MODEL_DIR):
+        print_info('Pretrained model dir: ', cfg.TRAIN.PRETRAINED_MODEL_DIR)
+        load_vars = []
+        load_fail_vars = []
+        def var_shape_matched(var, shape):
+            """
+            Check whehter persitable variable shape is match with current network
+            """
+            var_exist = os.path.exists(
+                os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+            if var_exist:
+                var_shape = parse_shape_from_file(
+                    os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                return var_shape == shape
+            return False
+        for x in fluid.default_main_program().list_vars():
+            if isinstance(x, fluid.framework.Parameter):
+                shape = tuple(fluid.global_scope().find_var(
+                    x.name).get_tensor().shape())
+                if var_shape_matched(x, shape):
+                    load_vars.append(x)
+                else:
+                    load_fail_vars.append(x)
+        fluid.io.load_vars(
+            exe, dirname=cfg.TRAIN.PRETRAINED_MODEL_DIR, vars=load_vars)
+        for var in load_vars:
+            print_info("Parameter[{}] loaded sucessfully!".format(var.name))
+        for var in load_fail_vars:
+            print_info(
+                "Parameter[{}] don't exist or shape does not match current network, skip"
+                " to load it.".format(var.name))
+        print_info("{}/{} pretrained parameters loaded successfully!".format(
+            len(load_vars),
+            len(load_vars) + len(load_fail_vars)))
+    else:
+        print_info(
+            'Pretrained model dir {} not exists, training from scratch...'.
+            format(cfg.TRAIN.PRETRAINED_MODEL_DIR))
+    #fetch_list = [avg_loss.name, lr.name]
+    fetch_list = [loss.name, 'teacher_' + teacher_loss.name, distill_loss.name, lr.name]
+    if args.debug:
+        # Fetch more variable info and use streaming confusion matrix to
+        # calculate IoU results if in debug mode
+        np.set_printoptions(
+            precision=4, suppress=True, linewidth=160, floatmode="fixed")
+        fetch_list.extend([pred.name, grts.name, masks.name])
+        cm = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    if args.use_tb:
+        if not args.tb_log_dir:
+            print_info("Please specify the log directory by --tb_log_dir.")
+            exit(1)
+        from tb_paddle import SummaryWriter
+        log_writer = SummaryWriter(args.tb_log_dir)
+    # trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    # num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    global_step = 0
+    all_step = cfg.DATASET.TRAIN_TOTAL_IMAGES // cfg.BATCH_SIZE
+    if cfg.DATASET.TRAIN_TOTAL_IMAGES % cfg.BATCH_SIZE and drop_last != True:
+        all_step += 1
+    all_step *= (cfg.SOLVER.NUM_EPOCHS - begin_epoch + 1)
+    avg_loss = 0.0
+    avg_t_loss = 0.0
+    avg_d_loss = 0.0
+    best_mIoU = 0.0
+    timer = Timer()
+    timer.start()
+    if begin_epoch > cfg.SOLVER.NUM_EPOCHS:
+        raise ValueError(
+            ("begin epoch[{}] is larger than cfg.SOLVER.NUM_EPOCHS[{}]").format(
+                begin_epoch, cfg.SOLVER.NUM_EPOCHS))
+    if args.use_mpio:
+        print_info("Use multiprocess reader")
+    else:
+        print_info("Use multi-thread reader")
+    for epoch in range(begin_epoch, cfg.SOLVER.NUM_EPOCHS + 1):
+        py_reader.start()
+        while True:
+            try:
+                if args.debug:
+                    # Print category IoU and accuracy to check whether the
+                    # traning process is corresponed to expectation
+                    loss, lr, pred, grts, masks = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    cm.calculate(pred, grts, masks)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0:
+                        speed = args.log_steps / timer.elapsed_time()
+                        avg_loss /= args.log_steps
+                        category_acc, mean_acc = cm.accuracy()
+                        category_iou, mean_iou = cm.mean_iou()
+                        print_info((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} acc={:.5f} mIoU={:.5f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, mean_acc,
+                                 mean_iou, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        print_info("Category IoU: ", category_iou)
+                        print_info("Category Acc: ", category_acc)
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/mean_iou', mean_iou,
+                                                  global_step)
+                            log_writer.add_scalar('Train/mean_acc', mean_acc,
+                                                  global_step)
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/step/sec', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        cm.zero_matrix()
+                        timer.restart()
+                else:
+                    # If not in debug mode, avoid unnessary log and calculate
+                    loss, t_loss, d_loss, lr = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    avg_loss += np.mean(np.array(loss))
+                    avg_t_loss += np.mean(np.array(t_loss))
+                    avg_d_loss += np.mean(np.array(d_loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0 and cfg.TRAINER_ID == 0:
+                        avg_loss /= args.log_steps
+                        avg_t_loss /= args.log_steps
+                        avg_d_loss /= args.log_steps
+                        speed = args.log_steps / timer.elapsed_time()
+                        print((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} teacher loss={:.4f} distill loss={:.4f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, avg_t_loss, avg_d_loss, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/speed', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        avg_t_loss = 0.0
+                        avg_d_loss = 0.0
+                        timer.restart()
+            except fluid.core.EOFException:
+                py_reader.reset()
+                break
+            except Exception as e:
+                print(e)
+        if (epoch % cfg.TRAIN.SNAPSHOT_EPOCH == 0
+                or epoch == cfg.SOLVER.NUM_EPOCHS) and cfg.TRAINER_ID == 0:
+            ckpt_dir = save_checkpoint(exe, fluid.default_main_program(), epoch)
+            if args.do_eval:
+                print("Evaluation start")
+                _, mean_iou, _, mean_acc = evaluate(
+                    cfg=cfg,
+                    ckpt_dir=ckpt_dir,
+                    use_gpu=args.use_gpu,
+                    use_mpio=args.use_mpio)
+                if args.use_tb:
+                    log_writer.add_scalar('Evaluate/mean_iou', mean_iou,
+                                          global_step)
+                    log_writer.add_scalar('Evaluate/mean_acc', mean_acc,
+                                          global_step)
+                if mean_iou > best_mIoU:
+                    best_mIoU = mean_iou
+                    update_best_model(ckpt_dir)
+                    print_info("Save best model {} to {}, mIoU = {:.4f}".format(
+                        ckpt_dir,
+                        os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, 'best_model'),
+                        mean_iou))
+            # Use Tensorboard to visualize results
+            if args.use_tb and cfg.DATASET.VIS_FILE_LIST is not None:
+                visualize(
+                    cfg=cfg,
+                    use_gpu=args.use_gpu,
+                    vis_file_list=cfg.DATASET.VIS_FILE_LIST,
+                    vis_dir="visual",
+                    ckpt_dir=ckpt_dir,
+                    log_writer=log_writer)
+        if cfg.TRAINER_ID == 0:
+            ckpt_dir = save_checkpoint(exe, fluid.default_main_program(), epoch)
+    # save final model
+    if cfg.TRAINER_ID == 0:
+        save_checkpoint(exe, fluid.default_main_program(), 'final')
+def main(args):
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    if args.enable_ce:
+        random.seed(0)
+        np.random.seed(0)
+    cfg.TRAINER_ID = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    cfg.NUM_TRAINERS = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    cfg.check_and_infer()
+    print_info(pprint.pformat(cfg))
+    train(cfg)
+if __name__ == '__main__':
+    args = parse_args()
+    if fluid.core.is_compiled_with_cuda() != True and args.use_gpu == True:
+        print(
+            "You can not set use_gpu = True in the model because you are using paddlepaddle-cpu."
+        )
+        print(
+            "Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_gpu=False to run models on CPU."
+        )
+        sys.exit(1)
+    main(args)
--- a/slim/nas/README.md
+++ b/slim/nas/README.md
+>运行该示例前请安装Paddle1.6或更高版本
+# PaddleSeg神经网络搜索(NAS)示例
+在阅读本教程前，请确保您已经了解过[PaddleSeg使用说明](../../docs/usage.md)等章节，以便对PaddleSeg有一定的了解
+该文档介绍如何使用[PaddleSlim](https://paddlepaddle.github.io/PaddleSlim)对分割库中的模型进行搜索。
+该教程中所示操作，如无特殊说明，均在`PaddleSeg/`路径下执行。
+## 概述
+我们选取Deeplab+mobilenetv2模型作为神经网络搜索示例，该示例使用[PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)
+辅助完成神经网络搜索实验，具体技术细节，请您参考[神经网络搜索策略](https://github.com/PaddlePaddle/PaddleSlim/blob/4670a79343c191b61a78e416826d122eea52a7ab/docs/zh_cn/tutorials/image_classification_nas_quick_start.ipynb)。
+## 定义搜索空间
+搜索实验中，我们采用了SANAS的方式进行搜索，本次实验会对网络模型中的通道数和卷积核尺寸进行搜索。
+所以我们定义了如下搜索空间：
+- head通道模块`head_num`：定义了MobilenetV2 head模块中通道数变化区间；
+- inverse_res_block1-6`filter_num1-6`: 定义了inverse_res_block模块中通道数变化区间；
+- inverse_res_block`repeat`：定义了MobilenetV2 inverse_res_block模块中unit的个数；
+- inverse_res_block`multiply`：定义了MobilenetV2 inverse_res_block模块中expansion_factor变化区间；
+- 卷积核尺寸`k_size`：定义了MobilenetV2中卷积和尺寸大小是3x3或者5x5。
+根据定义的搜索空间各个区间，我们的搜索空间tokens共9位，变化区间在([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [7, 5, 8, 6, 2, 5, 8, 6, 2, 5, 8, 6, 2, 5, 10, 6, 2, 5, 10, 6, 2, 5, 12, 6, 2])范围内。  
+初始化tokens为：[4, 4, 5, 1, 0, 4, 4, 1, 0, 4, 4, 3, 0, 4, 5, 2, 0, 4, 7, 2, 0, 4, 9, 0, 0]。
+## 开始搜索
+首先需要安装PaddleSlim，请参考[安装教程](https://paddlepaddle.github.io/PaddleSlim/#_2)。
+配置paddleseg的config, 下面只展示nas相关的内容
+```shell
+SLIM:
+    NAS_PORT: 23333 # 端口
+    NAS_ADDRESS: "" # ip地址，作为server不用填写，作为client的时候需要填写server的ip
+    NAS_SEARCH_STEPS: 100 # 搜索多少个结构
+    NAS_START_EVAL_EPOCH: -1 # 第几个epoch开始对模型进行评估
+    NAS_IS_SERVER: True # 是否为server
+    NAS_SPACE_NAME: "MobileNetV2SpaceSeg" # 搜索空间
+```
+## 训练与评估
+执行以下命令，边训练边评估
+```shell
+CUDA_VISIBLE_DEVICES=0 python -u ./slim/nas/train_nas.py --log_steps 10 --cfg configs/deeplabv3p_mobilenetv2_cityscapes.yaml --use_gpu --use_mpio \
+SLIM.NAS_PORT 23333 \
+SLIM.NAS_ADDRESS "" \
+SLIM.NAS_SEARCH_STEPS 2 \
+SLIM.NAS_START_EVAL_EPOCH -1 \
+SLIM.NAS_IS_SERVER True \
+SLIM.NAS_SPACE_NAME "MobileNetV2SpaceSeg" \
+```
+## FAQ
+- 运行报错：`socket.error: [Errno 98] Address already in use`。
+解决方法：当前端口被占用，请修改`SLIM.NAS_PORT`端口。
--- a/slim/nas/deeplab.py
+++ b/slim/nas/deeplab.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import contextlib
+import paddle
+import paddle.fluid as fluid
+from utils.config import cfg
+from models.libs.model_libs import scope, name_scope
+from models.libs.model_libs import bn, bn_relu, relu
+from models.libs.model_libs import conv
+from models.libs.model_libs import separate_conv
+from models.backbone.mobilenet_v2 import MobileNetV2 as mobilenet_backbone
+from models.backbone.xception import Xception as xception_backbone
+def encoder(input):
+    # 编码器配置，采用ASPP架构，pooling + 1x1_conv + 三个不同尺度的空洞卷积并行, concat后1x1conv
+    # ASPP_WITH_SEP_CONV：默认为真，使用depthwise可分离卷积，否则使用普通卷积
+    # OUTPUT_STRIDE: 下采样倍数，8或16，决定aspp_ratios大小
+    # aspp_ratios：ASPP模块空洞卷积的采样率
+    if cfg.MODEL.DEEPLAB.OUTPUT_STRIDE == 16:
+        aspp_ratios = [6, 12, 18]
+    elif cfg.MODEL.DEEPLAB.OUTPUT_STRIDE == 8:
+        aspp_ratios = [12, 24, 36]
+    else:
+        raise Exception("deeplab only support stride 8 or 16")
+    param_attr = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=None,
+        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.06))
+    with scope('encoder'):
+        channel = 256
+        with scope("image_pool"):
+            image_avg = fluid.layers.reduce_mean(
+                input, [2, 3], keep_dim=True)
+            image_avg = bn_relu(
+                conv(
+                    image_avg,
+                    channel,
+                    1,
+                    1,
+                    groups=1,
+                    padding=0,
+                    param_attr=param_attr))
+            image_avg = fluid.layers.resize_bilinear(image_avg, input.shape[2:])
+        with scope("aspp0"):
+            aspp0 = bn_relu(
+                conv(
+                    input,
+                    channel,
+                    1,
+                    1,
+                    groups=1,
+                    padding=0,
+                    param_attr=param_attr))
+        with scope("aspp1"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp1 = separate_conv(
+                    input, channel, 1, 3, dilation=aspp_ratios[0], act=relu)
+            else:
+                aspp1 = bn_relu(
+                    conv(
+                        input,
+                        channel,
+                        stride=1,
+                        filter_size=3,
+                        dilation=aspp_ratios[0],
+                        padding=aspp_ratios[0],
+                        param_attr=param_attr))
+        with scope("aspp2"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp2 = separate_conv(
+                    input, channel, 1, 3, dilation=aspp_ratios[1], act=relu)
+            else:
+                aspp2 = bn_relu(
+                    conv(
+                        input,
+                        channel,
+                        stride=1,
+                        filter_size=3,
+                        dilation=aspp_ratios[1],
+                        padding=aspp_ratios[1],
+                        param_attr=param_attr))
+        with scope("aspp3"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp3 = separate_conv(
+                    input, channel, 1, 3, dilation=aspp_ratios[2], act=relu)
+            else:
+                aspp3 = bn_relu(
+                    conv(
+                        input,
+                        channel,
+                        stride=1,
+                        filter_size=3,
+                        dilation=aspp_ratios[2],
+                        padding=aspp_ratios[2],
+                        param_attr=param_attr))
+        with scope("concat"):
+            data = fluid.layers.concat([image_avg, aspp0, aspp1, aspp2, aspp3],
+                                       axis=1)
+            data = bn_relu(
+                conv(
+                    data,
+                    channel,
+                    1,
+                    1,
+                    groups=1,
+                    padding=0,
+                    param_attr=param_attr))
+            data = fluid.layers.dropout(data, 0.9)
+        return data
+def decoder(encode_data, decode_shortcut):
+    # 解码器配置
+    # encode_data：编码器输出
+    # decode_shortcut: 从backbone引出的分支, resize后与encode_data concat
+    # DECODER_USE_SEP_CONV: 默认为真，则concat后连接两个可分离卷积，否则为普通卷积
+    param_attr = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=None,
+        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.06))
+    with scope('decoder'):
+        with scope('concat'):
+            decode_shortcut = bn_relu(
+                conv(
+                    decode_shortcut,
+                    48,
+                    1,
+                    1,
+                    groups=1,
+                    padding=0,
+                    param_attr=param_attr))
+            encode_data = fluid.layers.resize_bilinear(
+                encode_data, decode_shortcut.shape[2:])
+            encode_data = fluid.layers.concat([encode_data, decode_shortcut],
+                                              axis=1)
+        if cfg.MODEL.DEEPLAB.DECODER_USE_SEP_CONV:
+            with scope("separable_conv1"):
+                encode_data = separate_conv(
+                    encode_data, 256, 1, 3, dilation=1, act=relu)
+            with scope("separable_conv2"):
+                encode_data = separate_conv(
+                    encode_data, 256, 1, 3, dilation=1, act=relu)
+        else:
+            with scope("decoder_conv1"):
+                encode_data = bn_relu(
+                    conv(
+                        encode_data,
+                        256,
+                        stride=1,
+                        filter_size=3,
+                        dilation=1,
+                        padding=1,
+                        param_attr=param_attr))
+            with scope("decoder_conv2"):
+                encode_data = bn_relu(
+                    conv(
+                        encode_data,
+                        256,
+                        stride=1,
+                        filter_size=3,
+                        dilation=1,
+                        padding=1,
+                        param_attr=param_attr))
+        return encode_data
+def nas_backbone(input, arch):
+    # scale = cfg.MODEL.DEEPLAB.DEPTH_MULTIPLIER
+    # output_stride = cfg.MODEL.DEEPLAB.OUTPUT_STRIDE
+    # model = mobilenet_backbone(scale=scale, output_stride=output_stride)
+    end_points = 8
+    decode_point = 3
+    data, decode_shortcuts = arch(
+        input, end_points=end_points, return_block=decode_point, output_stride=16)
+    decode_shortcut = decode_shortcuts[decode_point]
+    return data, decode_shortcut
+def deeplabv3p_nas(img, num_classes, arch=None):
+    data, decode_shortcut = nas_backbone(img, arch)
+    # 编码器解码器设置
+    cfg.MODEL.DEFAULT_EPSILON = 1e-5
+    if cfg.MODEL.DEEPLAB.ENCODER_WITH_ASPP:
+        data = encoder(data)
+    if cfg.MODEL.DEEPLAB.ENABLE_DECODER:
+        data = decoder(data, decode_shortcut)
+    # 根据类别数设置最后一个卷积层输出，并resize到图片原始尺寸
+    param_attr = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.0),
+        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.01))
+    with scope('logit'):
+        logit = conv(
+            data,
+            num_classes,
+            1,
+            stride=1,
+            padding=0,
+            bias_attr=True,
+            param_attr=param_attr)
+        logit = fluid.layers.resize_bilinear(logit, img.shape[2:])
+    return logit
--- a/slim/nas/eval_nas.py
+++ b/slim/nas/eval_nas.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+SEG_PATH = os.path.join(LOCAL_PATH, "../../", "pdseg")
+sys.path.append(SEG_PATH)
+import time
+import argparse
+import functools
+import pprint
+import cv2
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from model_builder import build_model
+from model_builder import ModelPhase
+from reader import SegDataset
+from metrics import ConfusionMatrix
+from mobilenetv2_search_space import MobileNetV2SpaceSeg
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg model evalution')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess IO or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+def evaluate(cfg, ckpt_dir=None, use_gpu=False, use_mpio=False, **kwargs):
+    np.set_printoptions(precision=5, suppress=True)
+    startup_prog = fluid.Program()
+    test_prog = fluid.Program()
+    dataset = SegDataset(
+        file_list=cfg.DATASET.VAL_FILE_LIST,
+        mode=ModelPhase.EVAL,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        #TODO: check is batch reader compatitable with Windows
+        if use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        for b in data_gen:
+            yield b[0], b[1], b[2]
+    py_reader, avg_loss, pred, grts, masks = build_model(
+        test_prog, startup_prog, phase=ModelPhase.EVAL, arch=kwargs['arch'])
+    py_reader.decorate_sample_generator(
+        data_generator, drop_last=False, batch_size=cfg.BATCH_SIZE)
+    # Get device environment
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    place = places[0]
+    dev_count = len(places)
+    print("#Device count: {}".format(dev_count))
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    test_prog = test_prog.clone(for_test=True)
+    ckpt_dir = cfg.TEST.TEST_MODEL if not ckpt_dir else ckpt_dir
+    if not os.path.exists(ckpt_dir):
+        raise ValueError('The TEST.TEST_MODEL {} is not found'.format(ckpt_dir))
+    if ckpt_dir is not None:
+        print('load test model:', ckpt_dir)
+        fluid.io.load_params(exe, ckpt_dir, main_program=test_prog)
+    # Use streaming confusion matrix to calculate mean_iou
+    np.set_printoptions(
+        precision=4, suppress=True, linewidth=160, floatmode="fixed")
+    conf_mat = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    fetch_list = [avg_loss.name, pred.name, grts.name, masks.name]
+    num_images = 0
+    step = 0
+    all_step = cfg.DATASET.TEST_TOTAL_IMAGES // cfg.BATCH_SIZE + 1
+    timer = Timer()
+    timer.start()
+    py_reader.start()
+    while True:
+        try:
+            step += 1
+            loss, pred, grts, masks = exe.run(
+                test_prog, fetch_list=fetch_list, return_numpy=True)
+            loss = np.mean(np.array(loss))
+            num_images += pred.shape[0]
+            conf_mat.calculate(pred, grts, masks)
+            _, iou = conf_mat.mean_iou()
+            _, acc = conf_mat.accuracy()
+            speed = 1.0 / timer.elapsed_time()
+            print(
+                "[EVAL]step={} loss={:.5f} acc={:.4f} IoU={:.4f} step/sec={:.2f} | ETA {}"
+                .format(step, loss, acc, iou, speed,
+                        calculate_eta(all_step - step, speed)))
+            timer.restart()
+            sys.stdout.flush()
+        except fluid.core.EOFException:
+            break
+    category_iou, avg_iou = conf_mat.mean_iou()
+    category_acc, avg_acc = conf_mat.accuracy()
+    print("[EVAL]#image={} acc={:.4f} IoU={:.4f}".format(
+        num_images, avg_acc, avg_iou))
+    print("[EVAL]Category IoU:", category_iou)
+    print("[EVAL]Category Acc:", category_acc)
+    print("[EVAL]Kappa:{:.4f}".format(conf_mat.kappa()))
+    return category_iou, avg_iou, category_acc, avg_acc
+def main():
+    args = parse_args()
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    cfg.check_and_infer()
+    print(pprint.pformat(cfg))
+    evaluate(cfg, **args.__dict__)
+if __name__ == '__main__':
+    main()
--- a/slim/nas/mobilenetv2_search_space.py
+++ b/slim/nas/mobilenetv2_search_space.py
+# Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddleslim.nas.search_space.search_space_base import SearchSpaceBase
+from paddleslim.nas.search_space.base_layer import conv_bn_layer
+from paddleslim.nas.search_space.search_space_registry import SEARCHSPACE
+from paddleslim.nas.search_space.utils import check_points
+__all__ = ["MobileNetV2SpaceSeg"]
+@SEARCHSPACE.register
+class MobileNetV2SpaceSeg(SearchSpaceBase):
+    def __init__(self, input_size, output_size, block_num, block_mask=None):
+        super(MobileNetV2SpaceSeg, self).__init__(input_size, output_size,
+                                               block_num, block_mask)
+        # self.head_num means the first convolution channel
+        self.head_num = np.array([3, 4, 8, 12, 16, 24, 32])  #7
+        # self.filter_num1 ~ self.filter_num6 means following convlution channel
+        self.filter_num1 = np.array([3, 4, 8, 12, 16, 24, 32, 48])  #8
+        self.filter_num2 = np.array([8, 12, 16, 24, 32, 48, 64, 80])  #8
+        self.filter_num3 = np.array([16, 24, 32, 48, 64, 80, 96, 128])  #8
+        self.filter_num4 = np.array(
+            [24, 32, 48, 64, 80, 96, 128, 144, 160, 192])  #10
+        self.filter_num5 = np.array(
+            [32, 48, 64, 80, 96, 128, 144, 160, 192, 224])  #10
+        self.filter_num6 = np.array(
+            [64, 80, 96, 128, 144, 160, 192, 224, 256, 320, 384, 512])  #12
+        # self.k_size means kernel size
+        self.k_size = np.array([3, 5])  #2
+        # self.multiply means expansion_factor of each _inverted_residual_unit
+        self.multiply = np.array([1, 2, 3, 4, 6])  #5
+        # self.repeat means repeat_num _inverted_residual_unit in each _invresi_blocks 
+        self.repeat = np.array([1, 2, 3, 4, 5, 6])  #6
+    def init_tokens(self):
+        """
+        The initial token.
+        The first one is the index of the first layers' channel in self.head_num,
+        each line in the following represent the index of the [expansion_factor, filter_num, repeat_num, kernel_size]
+        """
+        # original MobileNetV2
+        # yapf: disable
+        init_token_base =  [4,          # 1, 16, 1
+                4, 5, 1, 0, # 6, 24, 2
+                4, 4, 2, 0, # 6, 32, 3
+                4, 4, 3, 0, # 6, 64, 4
+                4, 5, 2, 0, # 6, 96, 3
+                4, 7, 2, 0, # 6, 160, 3
+                4, 9, 0, 0] # 6, 320, 1
+        # yapf: enable
+        return init_token_base
+    def range_table(self):
+        """
+        Get range table of current search space, constrains the range of tokens. 
+        """
+        # head_num + 6 * [multiple(expansion_factor), filter_num, repeat, kernel_size]
+        # yapf: disable
+        range_table_base =  [len(self.head_num),
+                len(self.multiply), len(self.filter_num1), len(self.repeat), len(self.k_size),
+                len(self.multiply), len(self.filter_num2), len(self.repeat), len(self.k_size),
+                len(self.multiply), len(self.filter_num3), len(self.repeat), len(self.k_size),
+                len(self.multiply), len(self.filter_num4), len(self.repeat), len(self.k_size),
+                len(self.multiply), len(self.filter_num5), len(self.repeat), len(self.k_size),
+                len(self.multiply), len(self.filter_num6), len(self.repeat), len(self.k_size)]
+        # yapf: enable
+        return range_table_base
+    def token2arch(self, tokens=None):
+        """
+        return net_arch function
+        """
+        if tokens is None:
+            tokens = self.init_tokens()
+        self.bottleneck_params_list = []
+        self.bottleneck_params_list.append(
+            (1, self.head_num[tokens[0]], 1, 1, 3))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[1]], self.filter_num1[tokens[2]],
+             self.repeat[tokens[3]], 2, self.k_size[tokens[4]]))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[5]], self.filter_num2[tokens[6]],
+             self.repeat[tokens[7]], 2, self.k_size[tokens[8]]))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[9]], self.filter_num3[tokens[10]],
+             self.repeat[tokens[11]], 2, self.k_size[tokens[12]]))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[13]], self.filter_num4[tokens[14]],
+             self.repeat[tokens[15]], 1, self.k_size[tokens[16]]))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[17]], self.filter_num5[tokens[18]],
+             self.repeat[tokens[19]], 2, self.k_size[tokens[20]]))
+        self.bottleneck_params_list.append(
+            (self.multiply[tokens[21]], self.filter_num6[tokens[22]],
+             self.repeat[tokens[23]], 1, self.k_size[tokens[24]]))
+        def _modify_bottle_params(output_stride=None):
+            if output_stride is not None and output_stride % 2 != 0:
+                raise Exception("output stride must to be even number")
+            if output_stride is None:
+                return
+            else:
+                stride = 2
+                for i, layer_setting in enumerate(self.bottleneck_params_list):
+                    t, c, n, s, ks = layer_setting
+                    stride = stride * s
+                    if stride > output_stride:
+                        s = 1
+                    self.bottleneck_params_list[i] = (t, c, n, s, ks)
+        def net_arch(input,
+                     scale=1.0,
+                     return_block=None,
+                     end_points=None,
+                     output_stride=None):
+            self.scale = scale
+            _modify_bottle_params(output_stride)
+            decode_ends = dict()
+            def check_points(count, points):
+                if points is None:
+                    return False
+                else:
+                    if isinstance(points, list):
+                        return (True if count in points else False)
+                    else:
+                        return (True if count == points else False)
+            #conv1
+            # all padding is 'SAME' in the conv2d, can compute the actual padding automatic. 
+            input = conv_bn_layer(
+                input,
+                num_filters=int(32 * self.scale),
+                filter_size=3,
+                stride=2,
+                padding='SAME',
+                act='relu6',
+                name='mobilenetv2_conv1')
+            layer_count = 1
+            depthwise_output = None
+            # bottleneck sequences
+            in_c = int(32 * self.scale)
+            for i, layer_setting in enumerate(self.bottleneck_params_list):
+                t, c, n, s, k = layer_setting
+                layer_count += 1
+                ### return_block and end_points means block num
+                if check_points((layer_count - 1), return_block):
+                    decode_ends[layer_count - 1] = depthwise_output
+                if check_points((layer_count - 1), end_points):
+                    return input, decode_ends
+                input, depthwise_output = self._invresi_blocks(
+                    input=input,
+                    in_c=in_c,
+                    t=t,
+                    c=int(c * self.scale),
+                    n=n,
+                    s=s,
+                    k=int(k),
+                    name='mobilenetv2_conv' + str(i))
+                in_c = int(c * self.scale)
+            ### return_block and end_points means block num
+            if check_points(layer_count, return_block):
+                decode_ends[layer_count] = depthwise_output
+            if check_points(layer_count, end_points):
+                return input, decode_ends
+            # last conv
+            input = conv_bn_layer(
+                input=input,
+                num_filters=int(1280 * self.scale)
+                if self.scale > 1.0 else 1280,
+                filter_size=1,
+                stride=1,
+                padding='SAME',
+                act='relu6',
+                name='mobilenetv2_conv' + str(i + 1))
+            input = fluid.layers.pool2d(
+                input=input,
+                pool_type='avg',
+                global_pooling=True,
+                name='mobilenetv2_last_pool')
+            return input
+        return net_arch
+    def _shortcut(self, input, data_residual):
+        """Build shortcut layer.
+        Args:
+            input(Variable): input.
+            data_residual(Variable): residual layer.
+        Returns:
+            Variable, layer output.
+        """
+        return fluid.layers.elementwise_add(input, data_residual)
+    def _inverted_residual_unit(self,
+                                input,
+                                num_in_filter,
+                                num_filters,
+                                ifshortcut,
+                                stride,
+                                filter_size,
+                                expansion_factor,
+                                reduction_ratio=4,
+                                name=None):
+        """Build inverted residual unit.
+        Args:
+            input(Variable), input.
+            num_in_filter(int), number of in filters.
+            num_filters(int), number of filters.
+            ifshortcut(bool), whether using shortcut.
+            stride(int), stride.
+            filter_size(int), filter size.
+            padding(str|int|list), padding.
+            expansion_factor(float), expansion factor.
+            name(str), name.
+        Returns:
+            Variable, layers output.
+        """
+        num_expfilter = int(round(num_in_filter * expansion_factor))
+        channel_expand = conv_bn_layer(
+            input=input,
+            num_filters=num_expfilter,
+            filter_size=1,
+            stride=1,
+            padding='SAME',
+            num_groups=1,
+            act='relu6',
+            name=name + '_expand')
+        bottleneck_conv = conv_bn_layer(
+            input=channel_expand,
+            num_filters=num_expfilter,
+            filter_size=filter_size,
+            stride=stride,
+            padding='SAME',
+            num_groups=num_expfilter,
+            act='relu6',
+            name=name + '_dwise',
+            use_cudnn=False)
+        depthwise_output = bottleneck_conv
+        linear_out = conv_bn_layer(
+            input=bottleneck_conv,
+            num_filters=num_filters,
+            filter_size=1,
+            stride=1,
+            padding='SAME',
+            num_groups=1,
+            act=None,
+            name=name + '_linear')
+        out = linear_out
+        if ifshortcut:
+            out = self._shortcut(input=input, data_residual=out)
+        return out, depthwise_output
+    def _invresi_blocks(self, input, in_c, t, c, n, s, k, name=None):
+        """Build inverted residual blocks.
+        Args:
+            input: Variable, input.
+            in_c: int, number of in filters.
+            t: float, expansion factor.
+            c: int, number of filters.
+            n: int, number of layers.
+            s: int, stride.
+            k: int, filter size.
+            name: str, name.
+        Returns:
+            Variable, layers output.
+        """
+        first_block, depthwise_output = self._inverted_residual_unit(
+            input=input,
+            num_in_filter=in_c,
+            num_filters=c,
+            ifshortcut=False,
+            stride=s,
+            filter_size=k,
+            expansion_factor=t,
+            name=name + '_1')
+        last_residual_block = first_block
+        last_c = c
+        for i in range(1, n):
+            last_residual_block, depthwise_output = self._inverted_residual_unit(
+                input=last_residual_block,
+                num_in_filter=last_c,
+                num_filters=c,
+                ifshortcut=True,
+                stride=1,
+                filter_size=k,
+                expansion_factor=t,
+                name=name + '_' + str(i + 1))
+        return last_residual_block, depthwise_output
--- a/slim/nas/model_builder.py
+++ b/slim/nas/model_builder.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import struct
+import paddle.fluid as fluid
+import numpy as np
+from paddle.fluid.proto.framework_pb2 import VarType
+import solver
+from utils.config import cfg
+from loss import multi_softmax_with_loss
+from loss import multi_dice_loss
+from loss import multi_bce_loss
+import deeplab
+class ModelPhase(object):
+    """
+    Standard name for model phase in PaddleSeg
+    The following standard keys are defined:
+    * `TRAIN`: training mode.
+    * `EVAL`: testing/evaluation mode.
+    * `PREDICT`: prediction/inference mode.
+    * `VISUAL` : visualization mode
+    """
+    TRAIN = 'train'
+    EVAL = 'eval'
+    PREDICT = 'predict'
+    VISUAL = 'visual'
+    @staticmethod
+    def is_train(phase):
+        return phase == ModelPhase.TRAIN
+    @staticmethod
+    def is_predict(phase):
+        return phase == ModelPhase.PREDICT
+    @staticmethod
+    def is_eval(phase):
+        return phase == ModelPhase.EVAL
+    @staticmethod
+    def is_visual(phase):
+        return phase == ModelPhase.VISUAL
+    @staticmethod
+    def is_valid_phase(phase):
+        """ Check valid phase """
+        if ModelPhase.is_train(phase) or ModelPhase.is_predict(phase) \
+                or ModelPhase.is_eval(phase) or ModelPhase.is_visual(phase):
+            return True
+        return False
+def seg_model(image, class_num, arch):
+    model_name = cfg.MODEL.MODEL_NAME
+    if model_name == 'deeplabv3p':
+        logits = deeplab.deeplabv3p_nas(image, class_num, arch)
+    else:
+        raise Exception(
+            "unknow model name, only support deeplabv3p"
+        )
+    return logits
+def softmax(logit):
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.softmax(logit)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+def sigmoid_to_softmax(logit):
+    """
+    one channel to two channel
+    """
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.sigmoid(logit)
+    logit_back = 1 - logit
+    logit = fluid.layers.concat([logit_back, logit], axis=-1)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+def export_preprocess(image):
+    """导出模型的预处理流程"""
+    image = fluid.layers.transpose(image, [0, 3, 1, 2])
+    origin_shape = fluid.layers.shape(image)[-2:]
+    # 不同AUG_METHOD方法的resize
+    if cfg.AUG.AUG_METHOD == 'unpadding':
+        h_fix = cfg.AUG.FIX_RESIZE_SIZE[1]
+        w_fix = cfg.AUG.FIX_RESIZE_SIZE[0]
+        image = fluid.layers.resize_bilinear(
+            image, out_shape=[h_fix, w_fix], align_corners=False, align_mode=0)
+    elif cfg.AUG.AUG_METHOD == 'rangescaling':
+        size = cfg.AUG.INF_RESIZE_VALUE
+        value = fluid.layers.reduce_max(origin_shape)
+        scale = float(size) / value.astype('float32')
+        image = fluid.layers.resize_bilinear(
+            image, scale=scale, align_corners=False, align_mode=0)
+    # 存储resize后图像shape
+    valid_shape = fluid.layers.shape(image)[-2:]
+    # padding到eval_crop_size大小
+    width = cfg.EVAL_CROP_SIZE[0]
+    height = cfg.EVAL_CROP_SIZE[1]
+    pad_target = fluid.layers.assign(
+        np.array([height, width]).astype('float32'))
+    up = fluid.layers.assign(np.array([0]).astype('float32'))
+    down = pad_target[0] - valid_shape[0]
+    left = up
+    right = pad_target[1] - valid_shape[1]
+    paddings = fluid.layers.concat([up, down, left, right])
+    paddings = fluid.layers.cast(paddings, 'int32')
+    image = fluid.layers.pad2d(image, paddings=paddings, pad_value=127.5)
+    # normalize
+    mean = np.array(cfg.MEAN).reshape(1, len(cfg.MEAN), 1, 1)
+    mean = fluid.layers.assign(mean.astype('float32'))
+    std = np.array(cfg.STD).reshape(1, len(cfg.STD), 1, 1)
+    std = fluid.layers.assign(std.astype('float32'))
+    image = (image / 255 - mean) / std
+    # 使后面的网络能通过类似image.shape获取特征图的shape
+    image = fluid.layers.reshape(
+        image, shape=[-1, cfg.DATASET.DATA_DIM, height, width])
+    return image, valid_shape, origin_shape
+def build_model(main_prog, start_prog, phase=ModelPhase.TRAIN, arch=None):
+    if not ModelPhase.is_valid_phase(phase):
+        raise ValueError("ModelPhase {} is not valid!".format(phase))
+    if ModelPhase.is_train(phase):
+        width = cfg.TRAIN_CROP_SIZE[0]
+        height = cfg.TRAIN_CROP_SIZE[1]
+    else:
+        width = cfg.EVAL_CROP_SIZE[0]
+        height = cfg.EVAL_CROP_SIZE[1]
+    image_shape = [cfg.DATASET.DATA_DIM, height, width]
+    grt_shape = [1, height, width]
+    class_num = cfg.DATASET.NUM_CLASSES
+    with fluid.program_guard(main_prog, start_prog):
+        with fluid.unique_name.guard():
+            # 在导出模型的时候，增加图像标准化预处理,减小预测部署时图像的处理流程
+            # 预测部署时只须对输入图像增加batch_size维度即可
+            if ModelPhase.is_predict(phase):
+                origin_image = fluid.layers.data(
+                    name='image',
+                    shape=[-1, -1, -1, cfg.DATASET.DATA_DIM],
+                    dtype='float32',
+                    append_batch_size=False)
+                image, valid_shape, origin_shape = export_preprocess(
+                    origin_image)
+            else:
+                image = fluid.layers.data(
+                    name='image', shape=image_shape, dtype='float32')
+            label = fluid.layers.data(
+                name='label', shape=grt_shape, dtype='int32')
+            mask = fluid.layers.data(
+                name='mask', shape=grt_shape, dtype='int32')
+            # use PyReader when doing traning and evaluation
+            if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+                py_reader = fluid.io.PyReader(
+                    feed_list=[image, label, mask],
+                    capacity=cfg.DATALOADER.BUF_SIZE,
+                    iterable=False,
+                    use_double_buffer=True)
+            loss_type = cfg.SOLVER.LOSS
+            if not isinstance(loss_type, list):
+                loss_type = list(loss_type)
+            # dice_loss或bce_loss只适用两类分割中
+            if class_num > 2 and (("dice_loss" in loss_type) or
+                                  ("bce_loss" in loss_type)):
+                raise Exception(
+                    "dice loss and bce loss is only applicable to binary classfication"
+                )
+            # 在两类分割情况下，当loss函数选择dice_loss或bce_loss的时候，最后logit输出通道数设置为1
+            if ("dice_loss" in loss_type) or ("bce_loss" in loss_type):
+                class_num = 1
+                if "softmax_loss" in loss_type:
+                    raise Exception(
+                        "softmax loss can not combine with dice loss or bce loss"
+                    )
+            logits = seg_model(image, class_num, arch)
+            # 根据选择的loss函数计算相应的损失函数
+            if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+                loss_valid = False
+                avg_loss_list = []
+                valid_loss = []
+                if "softmax_loss" in loss_type:
+                    weight = cfg.SOLVER.CROSS_ENTROPY_WEIGHT
+                    avg_loss_list.append(
+                        multi_softmax_with_loss(logits, label, mask, class_num, weight))
+                    loss_valid = True
+                    valid_loss.append("softmax_loss")
+                if "dice_loss" in loss_type:
+                    avg_loss_list.append(multi_dice_loss(logits, label, mask))
+                    loss_valid = True
+                    valid_loss.append("dice_loss")
+                if "bce_loss" in loss_type:
+                    avg_loss_list.append(multi_bce_loss(logits, label, mask))
+                    loss_valid = True
+                    valid_loss.append("bce_loss")
+                if not loss_valid:
+                    raise Exception(
+                        "SOLVER.LOSS: {} is set wrong. it should "
+                        "include one of (softmax_loss, bce_loss, dice_loss) at least"
+                        " example: ['softmax_loss'], ['dice_loss'], ['bce_loss', 'dice_loss']"
+                        .format(cfg.SOLVER.LOSS))
+                invalid_loss = [x for x in loss_type if x not in valid_loss]
+                if len(invalid_loss) > 0:
+                    print(
+                        "Warning: the loss {} you set is invalid. it will not be included in loss computed."
+                        .format(invalid_loss))
+                avg_loss = 0
+                for i in range(0, len(avg_loss_list)):
+                    avg_loss += avg_loss_list[i]
+            #get pred result in original size
+            if isinstance(logits, tuple):
+                logit = logits[0]
+            else:
+                logit = logits
+            if logit.shape[2:] != label.shape[2:]:
+                logit = fluid.layers.resize_bilinear(logit, label.shape[2:])
+            # return image input and logit output for inference graph prune
+            if ModelPhase.is_predict(phase):
+                # 两类分割中，使用dice_loss或bce_loss返回的logit为单通道，进行到两通道的变换
+                if class_num == 1:
+                    logit = sigmoid_to_softmax(logit)
+                else:
+                    logit = softmax(logit)
+                # 获取有效部分
+                logit = fluid.layers.slice(
+                    logit, axes=[2, 3], starts=[0, 0], ends=valid_shape)
+                logit = fluid.layers.resize_bilinear(
+                    logit,
+                    out_shape=origin_shape,
+                    align_corners=False,
+                    align_mode=0)
+                logit = fluid.layers.argmax(logit, axis=1)
+                return origin_image, logit
+            if class_num == 1:
+                out = sigmoid_to_softmax(logit)
+                out = fluid.layers.transpose(out, [0, 2, 3, 1])
+            else:
+                out = fluid.layers.transpose(logit, [0, 2, 3, 1])
+            pred = fluid.layers.argmax(out, axis=3)
+            pred = fluid.layers.unsqueeze(pred, axes=[3])
+            if ModelPhase.is_visual(phase):
+                if class_num == 1:
+                    logit = sigmoid_to_softmax(logit)
+                else:
+                    logit = softmax(logit)
+                return pred, logit
+            if ModelPhase.is_eval(phase):
+                return py_reader, avg_loss, pred, label, mask
+            if ModelPhase.is_train(phase):
+                optimizer = solver.Solver(main_prog, start_prog)
+                decayed_lr = optimizer.optimise(avg_loss)
+                return py_reader, avg_loss, decayed_lr, pred, label, mask
+def to_int(string, dest="I"):
+    return struct.unpack(dest, string)[0]
+def parse_shape_from_file(filename):
+    with open(filename, "rb") as file:
+        version = file.read(4)
+        lod_level = to_int(file.read(8), dest="Q")
+        for i in range(lod_level):
+            _size = to_int(file.read(8), dest="Q")
+            _ = file.read(_size)
+        version = file.read(4)
+        tensor_desc_size = to_int(file.read(4))
+        tensor_desc = VarType.TensorDesc()
+        tensor_desc.ParseFromString(file.read(tensor_desc_size))
+    return tuple(tensor_desc.dims)
--- a/slim/nas/train_nas.py
+++ b/slim/nas/train_nas.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+SEG_PATH = os.path.join(LOCAL_PATH, "../../", "pdseg")
+sys.path.append(SEG_PATH)
+import argparse
+import pprint
+import random
+import shutil
+import functools
+import paddle
+import numpy as np
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from metrics import ConfusionMatrix
+from reader import SegDataset
+from model_builder import build_model
+from model_builder import ModelPhase
+from model_builder import parse_shape_from_file
+from eval_nas import evaluate
+from vis import visualize
+from utils import dist_utils
+from mobilenetv2_search_space import MobileNetV2SpaceSeg
+from paddleslim.nas.search_space.search_space_factory import SearchSpaceFactory
+from paddleslim.analysis import flops
+from paddleslim.nas.sa_nas import SANAS
+from paddleslim.nas import search_space
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg training')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess I/O or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--log_steps',
+        dest='log_steps',
+        help='Display logging information at every log_steps',
+        default=10,
+        type=int)
+    parser.add_argument(
+        '--debug',
+        dest='debug',
+        help='debug mode, display detail information of training',
+        action='store_true')
+    parser.add_argument(
+        '--use_tb',
+        dest='use_tb',
+        help='whether to record the data during training to Tensorboard',
+        action='store_true')
+    parser.add_argument(
+        '--tb_log_dir',
+        dest='tb_log_dir',
+        help='Tensorboard logging directory',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--do_eval',
+        dest='do_eval',
+        help='Evaluation models result on every new checkpoint',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        '--enable_ce',
+        dest='enable_ce',
+        help='If set True, enable continuous evaluation job.'
+        'This flag is only used for internal test.',
+        action='store_true')
+    return parser.parse_args()
+def save_vars(executor, dirname, program=None, vars=None):
+    """
+    Temporary resolution for Win save variables compatability.
+    Will fix in PaddlePaddle v1.5.2
+    """
+    save_program = fluid.Program()
+    save_block = save_program.global_block()
+    for each_var in vars:
+        # NOTE: don't save the variable which type is RAW
+        if each_var.type == fluid.core.VarDesc.VarType.RAW:
+            continue
+        new_var = save_block.create_var(
+            name=each_var.name,
+            shape=each_var.shape,
+            dtype=each_var.dtype,
+            type=each_var.type,
+            lod_level=each_var.lod_level,
+            persistable=True)
+        file_path = os.path.join(dirname, new_var.name)
+        file_path = os.path.normpath(file_path)
+        save_block.append_op(
+            type='save',
+            inputs={'X': [new_var]},
+            outputs={},
+            attrs={'file_path': file_path})
+    executor.run(save_program)
+def save_checkpoint(exe, program, ckpt_name):
+    """
+    Save checkpoint for evaluation or resume training
+    """
+    ckpt_dir = os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, str(ckpt_name))
+    print("Save model checkpoint to {}".format(ckpt_dir))
+    if not os.path.isdir(ckpt_dir):
+        os.makedirs(ckpt_dir)
+    save_vars(
+        exe,
+        ckpt_dir,
+        program,
+        vars=list(filter(fluid.io.is_persistable, program.list_vars())))
+    return ckpt_dir
+def load_checkpoint(exe, program):
+    """
+    Load checkpoiont from pretrained model directory for resume training
+    """
+    print('Resume model training from:', cfg.TRAIN.RESUME_MODEL_DIR)
+    if not os.path.exists(cfg.TRAIN.RESUME_MODEL_DIR):
+        raise ValueError("TRAIN.PRETRAIN_MODEL {} not exist!".format(
+            cfg.TRAIN.RESUME_MODEL_DIR))
+    fluid.io.load_persistables(
+        exe, cfg.TRAIN.RESUME_MODEL_DIR, main_program=program)
+    model_path = cfg.TRAIN.RESUME_MODEL_DIR
+    # Check is path ended by path spearator
+    if model_path[-1] == os.sep:
+        model_path = model_path[0:-1]
+    epoch_name = os.path.basename(model_path)
+    # If resume model is final model
+    if epoch_name == 'final':
+        begin_epoch = cfg.SOLVER.NUM_EPOCHS
+    # If resume model path is end of digit, restore epoch status
+    elif epoch_name.isdigit():
+        epoch = int(epoch_name)
+        begin_epoch = epoch + 1
+    else:
+        raise ValueError("Resume model path is not valid!")
+    print("Model checkpoint loaded successfully!")
+    return begin_epoch
+def update_best_model(ckpt_dir):
+    best_model_dir = os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, 'best_model')
+    if os.path.exists(best_model_dir):
+        shutil.rmtree(best_model_dir)
+    shutil.copytree(ckpt_dir, best_model_dir)
+def print_info(*msg):
+    if cfg.TRAINER_ID == 0:
+        print(*msg)
+def train(cfg):
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    if args.enable_ce:
+        startup_prog.random_seed = 1000
+        train_prog.random_seed = 1000
+    drop_last = True
+    dataset = SegDataset(
+        file_list=cfg.DATASET.TRAIN_FILE_LIST,
+        mode=ModelPhase.TRAIN,
+        shuffle=True,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        if args.use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        batch_data = []
+        for b in data_gen:
+            batch_data.append(b)
+            if len(batch_data) == (cfg.BATCH_SIZE // cfg.NUM_TRAINERS):
+                for item in batch_data:
+                    yield item[0], item[1], item[2]
+                batch_data = []
+        # If use sync batch norm strategy, drop last batch if number of samples
+        # in batch_data is less then cfg.BATCH_SIZE to avoid NCCL hang issues
+        if not cfg.TRAIN.SYNC_BATCH_NORM:
+            for item in batch_data:
+                yield item[0], item[1], item[2]
+    # Get device environment
+    # places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # place = places[0]
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+    places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # Get number of GPU
+    dev_count = cfg.NUM_TRAINERS if cfg.NUM_TRAINERS > 1 else len(places)
+    print_info("#Device count: {}".format(dev_count))
+    # Make sure BATCH_SIZE can divided by GPU cards
+    assert cfg.BATCH_SIZE % dev_count == 0, (
+        'BATCH_SIZE:{} not divisble by number of GPUs:{}'.format(
+            cfg.BATCH_SIZE, dev_count))
+    # If use multi-gpu training mode, batch data will allocated to each GPU evenly
+    batch_size_per_dev = cfg.BATCH_SIZE // dev_count
+    print_info("batch_size_per_dev: {}".format(batch_size_per_dev))
+    config_info = {'input_size': 769, 'output_size': 1, 'block_num': 7}
+    config = ([(cfg.SLIM.NAS_SPACE_NAME, config_info)])
+    factory = SearchSpaceFactory()
+    space = factory.get_search_space(config)
+    port = cfg.SLIM.NAS_PORT
+    server_address = (cfg.SLIM.NAS_ADDRESS, port)
+    sa_nas = SANAS(config, server_addr=server_address, search_steps=cfg.SLIM.NAS_SEARCH_STEPS,
+                   is_server=cfg.SLIM.NAS_IS_SERVER)
+    for step in range(cfg.SLIM.NAS_SEARCH_STEPS):
+        arch = sa_nas.next_archs()[0]
+        start_prog = fluid.Program()
+        train_prog = fluid.Program()
+        py_reader, avg_loss, lr, pred, grts, masks = build_model(
+            train_prog, start_prog, arch=arch, phase=ModelPhase.TRAIN)
+        cur_flops = flops(train_prog)
+        print('current step:', step, 'flops:', cur_flops)
+        py_reader.decorate_sample_generator(
+            data_generator, batch_size=batch_size_per_dev, drop_last=drop_last)
+        exe = fluid.Executor(place)
+        exe.run(start_prog)
+        exec_strategy = fluid.ExecutionStrategy()
+        # Clear temporary variables every 100 iteration
+        if args.use_gpu:
+            exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+        exec_strategy.num_iteration_per_drop_scope = 100
+        build_strategy = fluid.BuildStrategy()
+        if cfg.NUM_TRAINERS > 1 and args.use_gpu:
+            dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog)
+            exec_strategy.num_threads = 1
+        if cfg.TRAIN.SYNC_BATCH_NORM and args.use_gpu:
+            if dev_count > 1:
+                # Apply sync batch norm strategy
+                print_info("Sync BatchNorm strategy is effective.")
+                build_strategy.sync_batch_norm = True
+            else:
+                print_info(
+                    "Sync BatchNorm strategy will not be effective if GPU device"
+                    " count <= 1")
+        compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
+            loss_name=avg_loss.name,
+            exec_strategy=exec_strategy,
+            build_strategy=build_strategy)
+        # Resume training
+        begin_epoch = cfg.SOLVER.BEGIN_EPOCH
+        if cfg.TRAIN.RESUME_MODEL_DIR:
+            begin_epoch = load_checkpoint(exe, train_prog)
+        # Load pretrained model
+        elif os.path.exists(cfg.TRAIN.PRETRAINED_MODEL_DIR):
+            print_info('Pretrained model dir: ', cfg.TRAIN.PRETRAINED_MODEL_DIR)
+            load_vars = []
+            load_fail_vars = []
+            def var_shape_matched(var, shape):
+                """
+                Check whehter persitable variable shape is match with current network
+                """
+                var_exist = os.path.exists(
+                    os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                if var_exist:
+                    var_shape = parse_shape_from_file(
+                        os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                    return var_shape == shape
+                return False
+            for x in train_prog.list_vars():
+                if isinstance(x, fluid.framework.Parameter):
+                    shape = tuple(fluid.global_scope().find_var(
+                        x.name).get_tensor().shape())
+                    if var_shape_matched(x, shape):
+                        load_vars.append(x)
+                    else:
+                        load_fail_vars.append(x)
+            fluid.io.load_vars(
+                exe, dirname=cfg.TRAIN.PRETRAINED_MODEL_DIR, vars=load_vars)
+            for var in load_vars:
+                print_info("Parameter[{}] loaded sucessfully!".format(var.name))
+            for var in load_fail_vars:
+                print_info(
+                    "Parameter[{}] don't exist or shape does not match current network, skip"
+                    " to load it.".format(var.name))
+            print_info("{}/{} pretrained parameters loaded successfully!".format(
+                len(load_vars),
+                len(load_vars) + len(load_fail_vars)))
+        else:
+            print_info(
+                'Pretrained model dir {} not exists, training from scratch...'.
+                    format(cfg.TRAIN.PRETRAINED_MODEL_DIR))
+        fetch_list = [avg_loss.name, lr.name]
+        global_step = 0
+        all_step = cfg.DATASET.TRAIN_TOTAL_IMAGES // cfg.BATCH_SIZE
+        if cfg.DATASET.TRAIN_TOTAL_IMAGES % cfg.BATCH_SIZE and drop_last != True:
+            all_step += 1
+        all_step *= (cfg.SOLVER.NUM_EPOCHS - begin_epoch + 1)
+        avg_loss = 0.0
+        timer = Timer()
+        timer.start()
+        if begin_epoch > cfg.SOLVER.NUM_EPOCHS:
+            raise ValueError(
+                ("begin epoch[{}] is larger than cfg.SOLVER.NUM_EPOCHS[{}]").format(
+                    begin_epoch, cfg.SOLVER.NUM_EPOCHS))
+        if args.use_mpio:
+            print_info("Use multiprocess reader")
+        else:
+            print_info("Use multi-thread reader")
+        best_miou = 0.0
+        for epoch in range(begin_epoch, cfg.SOLVER.NUM_EPOCHS + 1):
+            py_reader.start()
+            while True:
+                try:
+                    loss, lr = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0 and cfg.TRAINER_ID == 0:
+                        avg_loss /= args.log_steps
+                        speed = args.log_steps / timer.elapsed_time()
+                        print((
+                                  "epoch={} step={} lr={:.5f} loss={:.4f} step/sec={:.3f} | ETA {}"
+                              ).format(epoch, global_step, lr[0], avg_loss, speed,
+                                       calculate_eta(all_step - global_step, speed)))
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        timer.restart()
+                except fluid.core.EOFException:
+                    py_reader.reset()
+                    break
+                except Exception as e:
+                    print(e)
+            if epoch > cfg.SLIM.NAS_START_EVAL_EPOCH:
+                ckpt_dir = save_checkpoint(exe, train_prog, '{}_tmp'.format(port))
+                _, mean_iou, _, mean_acc = evaluate(
+                    cfg=cfg,
+                    arch=arch,
+                    ckpt_dir=ckpt_dir,
+                    use_gpu=args.use_gpu,
+                    use_mpio=args.use_mpio)
+                if best_miou < mean_iou:
+                    print('search step {}, epoch {} best iou {}'.format(step, epoch, mean_iou))
+                    best_miou = mean_iou
+        sa_nas.reward(float(best_miou))
+def main(args):
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    if args.enable_ce:
+        random.seed(0)
+        np.random.seed(0)
+    cfg.TRAINER_ID = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    cfg.NUM_TRAINERS = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    cfg.check_and_infer()
+    print_info(pprint.pformat(cfg))
+    train(cfg)
+if __name__ == '__main__':
+    args = parse_args()
+    if fluid.core.is_compiled_with_cuda() != True and args.use_gpu == True:
+        print(
+            "You can not set use_gpu = True in the model because you are using paddlepaddle-cpu."
+        )
+        print(
+            "Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_gpu=False to run models on CPU."
+        )
+        sys.exit(1)
+    main(args)
--- a/slim/prune/README.md
+++ b/slim/prune/README.md
+# PaddleSeg剪裁教程
+在阅读本教程前，请确保您已经了解过[PaddleSeg使用说明](../../docs/usage.md)等章节，以便对PaddleSeg有一定的了解
+该文档介绍如何使用[PaddleSlim](https://paddlepaddle.github.io/PaddleSlim)的卷积通道剪裁接口对检测库中的模型的卷积层的通道数进行剪裁。
+在分割库中，可以直接调用`PaddleSeg/slim/prune/train_prune.py`脚本实现剪裁，在该脚本中调用了PaddleSlim的[paddleslim.prune.Pruner](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/#Pruner)接口。
+该教程中所示操作，如无特殊说明，均在`PaddleSeg/`路径下执行。
+## 1. 数据与预训练模型准备
+执行如下命令，下载cityscapes数据集
+```
+python dataset/download_cityscapes.py
+```
+参照[预训练模型列表](../../docs/model_zoo.md)获取所需预训练模型
+## 2. 确定待分析参数
+我们通过剪裁卷积层参数达到缩减卷积层通道数的目的，在剪裁之前，我们需要确定待裁卷积层的参数的名称。
+通过以下命令查看当前模型的所有参数：
+```python
+# 查看模型所有Paramters
+for x in train_prog.list_vars():
+    if isinstance(x, fluid.framework.Parameter):
+        print(x.name, x.shape)
+```
+通过观察参数名称和参数的形状，筛选出所有卷积层参数，并确定要裁剪的卷积层参数。
+## 3. 启动剪裁任务
+使用`train_prune.py`启动裁剪任务时，通过`SLIM.PRUNE_PARAMS`选项指定待裁剪的参数名称列表，参数名之间用逗号分隔，通过`SLIM.PRUNE_RATIOS`选项指定各个参数被裁掉的比例。
+```shell
+CUDA_VISIBLE_DEVICES=0 
+python -u ./slim/prune/train_prune.py --log_steps 10 --cfg configs/cityscape_fast_scnn.yaml --use_gpu --use_mpio \
+SLIM.PRUNE_PARAMS 'learning_to_downsample/weights,learning_to_downsample/dsconv1/pointwise/weights,learning_to_downsample/dsconv2/pointwise/weights' \
+SLIM.PRUNE_RATIOS '[0.1,0.1,0.1]'
+```
+这里我们选取三个参数，按0.1的比例剪裁。
+## 4. 评估
+```shell
+CUDA_VISIBLE_DEVICES=0 
+python -u ./slim/prune/eval_prune.py --cfg configs/cityscape_fast_scnn.yaml --use_gpu --use_mpio \
+TEST.TEST_MODEL your_trained_model \
+```
+## 5. 模型
+| 模型 | 数据集合 | 下载地址 |剪裁方法| flops | mIoU on val|
+|---|---|---|---|---|---|
+| Fast-SCNN/bn | Cityscapes |[fast_scnn_cityscapes.tar](https://paddleseg.bj.bcebos.com/models/fast_scnn_cityscape.tar) | 无 | 7.21g | 0.6964 |
+| Fast-SCNN/bn | Cityscapes |[fast_scnn_cityscapes-uniform-51.tar](https://paddleseg.bj.bcebos.com/models/fast_scnn_cityscape-uniform-51.tar) | uniform | 3.54g | 0.6990 |
--- a/slim/prune/eval_prune.py
+++ b/slim/prune/eval_prune.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+SEG_PATH = os.path.join(LOCAL_PATH, "../../", "pdseg")
+sys.path.append(SEG_PATH)
+import time
+import argparse
+import functools
+import pprint
+import cv2
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from models.model_builder import build_model
+from models.model_builder import ModelPhase
+from reader import SegDataset
+from metrics import ConfusionMatrix
+from paddleslim.prune.io import *
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg model evalution')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess IO or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+def evaluate(cfg, ckpt_dir=None, use_gpu=False, use_mpio=False, **kwargs):
+    np.set_printoptions(precision=5, suppress=True)
+    startup_prog = fluid.Program()
+    test_prog = fluid.Program()
+    dataset = SegDataset(
+        file_list=cfg.DATASET.VAL_FILE_LIST,
+        mode=ModelPhase.EVAL,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        #TODO: check is batch reader compatitable with Windows
+        if use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        for b in data_gen:
+            yield b[0], b[1], b[2]
+    py_reader, avg_loss, pred, grts, masks = build_model(
+        test_prog, startup_prog, phase=ModelPhase.EVAL)
+    py_reader.decorate_sample_generator(
+        data_generator, drop_last=False, batch_size=cfg.BATCH_SIZE)
+    # Get device environment
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    place = places[0]
+    dev_count = len(places)
+    print("#Device count: {}".format(dev_count))
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    test_prog = test_prog.clone(for_test=True)
+    ckpt_dir = cfg.TEST.TEST_MODEL if not ckpt_dir else ckpt_dir
+    if not os.path.exists(ckpt_dir):
+        raise ValueError('The TEST.TEST_MODEL {} is not found'.format(ckpt_dir))
+    if ckpt_dir is not None:
+        print('load test model:', ckpt_dir)
+        load_model(exe, test_prog, ckpt_dir)
+    # Use streaming confusion matrix to calculate mean_iou
+    np.set_printoptions(
+        precision=4, suppress=True, linewidth=160, floatmode="fixed")
+    conf_mat = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    fetch_list = [avg_loss.name, pred.name, grts.name, masks.name]
+    num_images = 0
+    step = 0
+    all_step = cfg.DATASET.TEST_TOTAL_IMAGES // cfg.BATCH_SIZE + 1
+    timer = Timer()
+    timer.start()
+    py_reader.start()
+    while True:
+        try:
+            step += 1
+            loss, pred, grts, masks = exe.run(
+                test_prog, fetch_list=fetch_list, return_numpy=True)
+            loss = np.mean(np.array(loss))
+            num_images += pred.shape[0]
+            conf_mat.calculate(pred, grts, masks)
+            _, iou = conf_mat.mean_iou()
+            _, acc = conf_mat.accuracy()
+            speed = 1.0 / timer.elapsed_time()
+            print(
+                "[EVAL]step={} loss={:.5f} acc={:.4f} IoU={:.4f} step/sec={:.2f} | ETA {}"
+                .format(step, loss, acc, iou, speed,
+                        calculate_eta(all_step - step, speed)))
+            timer.restart()
+            sys.stdout.flush()
+        except fluid.core.EOFException:
+            break
+    category_iou, avg_iou = conf_mat.mean_iou()
+    category_acc, avg_acc = conf_mat.accuracy()
+    print("[EVAL]#image={} acc={:.4f} IoU={:.4f}".format(
+        num_images, avg_acc, avg_iou))
+    print("[EVAL]Category IoU:", category_iou)
+    print("[EVAL]Category Acc:", category_acc)
+    print("[EVAL]Kappa:{:.4f}".format(conf_mat.kappa()))
+    return category_iou, avg_iou, category_acc, avg_acc
+def main():
+    args = parse_args()
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    cfg.check_and_infer()
+    print(pprint.pformat(cfg))
+    evaluate(cfg, **args.__dict__)
+if __name__ == '__main__':
+    main()
--- a/slim/prune/train_prune.py
+++ b/slim/prune/train_prune.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+SEG_PATH = os.path.join(LOCAL_PATH, "../../", "pdseg")
+sys.path.append(SEG_PATH)
+import argparse
+import pprint
+import shutil
+import functools
+import paddle
+import numpy as np
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from metrics import ConfusionMatrix
+from reader import SegDataset
+from models.model_builder import build_model
+from models.model_builder import ModelPhase
+from models.model_builder import parse_shape_from_file
+from eval_prune import evaluate
+from vis import visualize
+from utils import dist_utils
+from paddleslim.prune import Pruner
+from paddleslim.prune.io import *
+from paddleslim.analysis import flops
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg training')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess I/O or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--log_steps',
+        dest='log_steps',
+        help='Display logging information at every log_steps',
+        default=10,
+        type=int)
+    parser.add_argument(
+        '--debug',
+        dest='debug',
+        help='debug mode, display detail information of training',
+        action='store_true')
+    parser.add_argument(
+        '--use_tb',
+        dest='use_tb',
+        help='whether to record the data during training to Tensorboard',
+        action='store_true')
+    parser.add_argument(
+        '--tb_log_dir',
+        dest='tb_log_dir',
+        help='Tensorboard logging directory',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--do_eval',
+        dest='do_eval',
+        help='Evaluation models result on every new checkpoint',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    return parser.parse_args()
+def save_vars(executor, dirname, program=None, vars=None):
+    """
+    Temporary resolution for Win save variables compatability.
+    Will fix in PaddlePaddle v1.5.2
+    """
+    save_program = fluid.Program()
+    save_block = save_program.global_block()
+    for each_var in vars:
+        # NOTE: don't save the variable which type is RAW
+        if each_var.type == fluid.core.VarDesc.VarType.RAW:
+            continue
+        new_var = save_block.create_var(
+            name=each_var.name,
+            shape=each_var.shape,
+            dtype=each_var.dtype,
+            type=each_var.type,
+            lod_level=each_var.lod_level,
+            persistable=True)
+        file_path = os.path.join(dirname, new_var.name)
+        file_path = os.path.normpath(file_path)
+        save_block.append_op(
+            type='save',
+            inputs={'X': [new_var]},
+            outputs={},
+            attrs={'file_path': file_path})
+    executor.run(save_program)
+def save_prune_checkpoint(exe, program, ckpt_name):
+    """
+    Save checkpoint for evaluation or resume training
+    """
+    ckpt_dir = os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, str(ckpt_name))
+    print("Save model checkpoint to {}".format(ckpt_dir))
+    if not os.path.isdir(ckpt_dir):
+        os.makedirs(ckpt_dir)
+    save_model(exe, program, ckpt_dir)
+    return ckpt_dir
+def load_checkpoint(exe, program):
+    """
+    Load checkpoiont from pretrained model directory for resume training
+    """
+    print('Resume model training from:', cfg.TRAIN.RESUME_MODEL_DIR)
+    if not os.path.exists(cfg.TRAIN.RESUME_MODEL_DIR):
+        raise ValueError("TRAIN.PRETRAIN_MODEL {} not exist!".format(
+            cfg.TRAIN.RESUME_MODEL_DIR))
+    fluid.io.load_persistables(
+        exe, cfg.TRAIN.RESUME_MODEL_DIR, main_program=program)
+    model_path = cfg.TRAIN.RESUME_MODEL_DIR
+    # Check is path ended by path spearator
+    if model_path[-1] == os.sep:
+        model_path = model_path[0:-1]
+    epoch_name = os.path.basename(model_path)
+    # If resume model is final model
+    if epoch_name == 'final':
+        begin_epoch = cfg.SOLVER.NUM_EPOCHS
+    # If resume model path is end of digit, restore epoch status
+    elif epoch_name.isdigit():
+        epoch = int(epoch_name)
+        begin_epoch = epoch + 1
+    else:
+        raise ValueError("Resume model path is not valid!")
+    print("Model checkpoint loaded successfully!")
+    return begin_epoch
+def print_info(*msg):
+    if cfg.TRAINER_ID == 0:
+        print(*msg)
+def train(cfg):
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    drop_last = True
+    dataset = SegDataset(
+        file_list=cfg.DATASET.TRAIN_FILE_LIST,
+        mode=ModelPhase.TRAIN,
+        shuffle=True,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        if args.use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        batch_data = []
+        for b in data_gen:
+            batch_data.append(b)
+            if len(batch_data) == (cfg.BATCH_SIZE // cfg.NUM_TRAINERS):
+                for item in batch_data:
+                    yield item[0], item[1], item[2]
+                batch_data = []
+        # If use sync batch norm strategy, drop last batch if number of samples
+        # in batch_data is less then cfg.BATCH_SIZE to avoid NCCL hang issues
+        if not cfg.TRAIN.SYNC_BATCH_NORM:
+            for item in batch_data:
+                yield item[0], item[1], item[2]
+    # Get device environment
+    # places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # place = places[0]
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+    places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # Get number of GPU
+    dev_count = cfg.NUM_TRAINERS if cfg.NUM_TRAINERS > 1 else len(places)
+    print_info("#Device count: {}".format(dev_count))
+    # Make sure BATCH_SIZE can divided by GPU cards
+    assert cfg.BATCH_SIZE % dev_count == 0, (
+        'BATCH_SIZE:{} not divisble by number of GPUs:{}'.format(
+            cfg.BATCH_SIZE, dev_count))
+    # If use multi-gpu training mode, batch data will allocated to each GPU evenly
+    batch_size_per_dev = cfg.BATCH_SIZE // dev_count
+    print_info("batch_size_per_dev: {}".format(batch_size_per_dev))
+    py_reader, avg_loss, lr, pred, grts, masks = build_model(
+        train_prog, startup_prog, phase=ModelPhase.TRAIN)
+    py_reader.decorate_sample_generator(
+        data_generator, batch_size=batch_size_per_dev, drop_last=drop_last)
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    exec_strategy = fluid.ExecutionStrategy()
+    # Clear temporary variables every 100 iteration
+    if args.use_gpu:
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+    exec_strategy.num_iteration_per_drop_scope = 100
+    build_strategy = fluid.BuildStrategy()
+    if cfg.NUM_TRAINERS > 1 and args.use_gpu:
+        dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog)
+        exec_strategy.num_threads = 1
+    if cfg.TRAIN.SYNC_BATCH_NORM and args.use_gpu:
+        if dev_count > 1:
+            # Apply sync batch norm strategy
+            print_info("Sync BatchNorm strategy is effective.")
+            build_strategy.sync_batch_norm = True
+        else:
+            print_info("Sync BatchNorm strategy will not be effective if GPU device"
+                  " count <= 1")
+    pruned_params = cfg.SLIM.PRUNE_PARAMS.strip().split(',')
+    pruned_ratios = cfg.SLIM.PRUNE_RATIOS
+    if isinstance(pruned_ratios, float):
+        pruned_ratios = [pruned_ratios] * len(pruned_params)
+    elif isinstance(pruned_ratios, (list, tuple)):
+        pruned_ratios = list(pruned_ratios)
+    else:
+        raise ValueError('expect SLIM.PRUNE_RATIOS type is float, list, tuple, '
+                         'but received {}'.format(type(pruned_ratios)))
+    # Resume training
+    begin_epoch = cfg.SOLVER.BEGIN_EPOCH
+    if cfg.TRAIN.RESUME_MODEL_DIR:
+        begin_epoch = load_checkpoint(exe, train_prog)
+    # Load pretrained model
+    elif os.path.exists(cfg.TRAIN.PRETRAINED_MODEL_DIR):
+        print_info('Pretrained model dir: ', cfg.TRAIN.PRETRAINED_MODEL_DIR)
+        load_vars = []
+        load_fail_vars = []
+        def var_shape_matched(var, shape):
+            """
+            Check whehter persitable variable shape is match with current network
+            """
+            var_exist = os.path.exists(
+                os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+            if var_exist:
+                var_shape = parse_shape_from_file(
+                    os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                return var_shape == shape
+            return False
+        for x in train_prog.list_vars():
+            if isinstance(x, fluid.framework.Parameter):
+                shape = tuple(fluid.global_scope().find_var(
+                    x.name).get_tensor().shape())
+                if var_shape_matched(x, shape):
+                    load_vars.append(x)
+                else:
+                    load_fail_vars.append(x)
+        fluid.io.load_vars(
+            exe, dirname=cfg.TRAIN.PRETRAINED_MODEL_DIR, vars=load_vars)
+        for var in load_vars:
+            print_info("Parameter[{}] loaded sucessfully!".format(var.name))
+        for var in load_fail_vars:
+            print_info("Parameter[{}] don't exist or shape does not match current network, skip"
+                  " to load it.".format(var.name))
+        print_info("{}/{} pretrained parameters loaded successfully!".format(
+            len(load_vars),
+            len(load_vars) + len(load_fail_vars)))
+    else:
+        print_info('Pretrained model dir {} not exists, training from scratch...'.
+              format(cfg.TRAIN.PRETRAINED_MODEL_DIR))
+    fetch_list = [avg_loss.name, lr.name]
+    if args.debug:
+        # Fetch more variable info and use streaming confusion matrix to
+        # calculate IoU results if in debug mode
+        np.set_printoptions(
+            precision=4, suppress=True, linewidth=160, floatmode="fixed")
+        fetch_list.extend([pred.name, grts.name, masks.name])
+        cm = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    if args.use_tb:
+        if not args.tb_log_dir:
+            print_info("Please specify the log directory by --tb_log_dir.")
+            exit(1)
+        from tb_paddle import SummaryWriter
+        log_writer = SummaryWriter(args.tb_log_dir)
+    pruner = Pruner()
+    train_prog = pruner.prune(
+        train_prog,
+        fluid.global_scope(),
+        params=pruned_params,
+        ratios=pruned_ratios,
+        place=place,
+        only_graph=False)[0]
+    compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
+        loss_name=avg_loss.name,
+        exec_strategy=exec_strategy,
+        build_strategy=build_strategy)
+    global_step = 0
+    all_step = cfg.DATASET.TRAIN_TOTAL_IMAGES // cfg.BATCH_SIZE
+    if cfg.DATASET.TRAIN_TOTAL_IMAGES % cfg.BATCH_SIZE and drop_last != True:
+        all_step += 1
+    all_step *= (cfg.SOLVER.NUM_EPOCHS - begin_epoch + 1)
+    avg_loss = 0.0
+    timer = Timer()
+    timer.start()
+    if begin_epoch > cfg.SOLVER.NUM_EPOCHS:
+        raise ValueError(
+            ("begin epoch[{}] is larger than cfg.SOLVER.NUM_EPOCHS[{}]").format(
+                begin_epoch, cfg.SOLVER.NUM_EPOCHS))
+    if args.use_mpio:
+        print_info("Use multiprocess reader")
+    else:
+        print_info("Use multi-thread reader")
+    for epoch in range(begin_epoch, cfg.SOLVER.NUM_EPOCHS + 1):
+        py_reader.start()
+        while True:
+            try:
+                if args.debug:
+                    # Print category IoU and accuracy to check whether the
+                    # traning process is corresponed to expectation
+                    loss, lr, pred, grts, masks = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    cm.calculate(pred, grts, masks)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0:
+                        speed = args.log_steps / timer.elapsed_time()
+                        avg_loss /= args.log_steps
+                        category_acc, mean_acc = cm.accuracy()
+                        category_iou, mean_iou = cm.mean_iou()
+                        print_info((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} acc={:.5f} mIoU={:.5f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, mean_acc,
+                                 mean_iou, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        print_info("Category IoU: ", category_iou)
+                        print_info("Category Acc: ", category_acc)
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/mean_iou', mean_iou,
+                                                  global_step)
+                            log_writer.add_scalar('Train/mean_acc', mean_acc,
+                                                  global_step)
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/step/sec', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        cm.zero_matrix()
+                        timer.restart()
+                else:
+                    # If not in debug mode, avoid unnessary log and calculate
+                    loss, lr = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0 and cfg.TRAINER_ID == 0:
+                        avg_loss /= args.log_steps
+                        speed = args.log_steps / timer.elapsed_time()
+                        print((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/speed', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        timer.restart()
+            except fluid.core.EOFException:
+                py_reader.reset()
+                break
+            except Exception as e:
+                print(e)
+        if epoch % cfg.TRAIN.SNAPSHOT_EPOCH == 0 and cfg.TRAINER_ID == 0:
+            ckpt_dir = save_prune_checkpoint(exe, train_prog, epoch)
+            if args.do_eval:
+                print("Evaluation start")
+                _, mean_iou, _, mean_acc = evaluate(
+                    cfg=cfg,
+                    ckpt_dir=ckpt_dir,
+                    use_gpu=args.use_gpu,
+                    use_mpio=args.use_mpio)
+                if args.use_tb:
+                    log_writer.add_scalar('Evaluate/mean_iou', mean_iou,
+                                          global_step)
+                    log_writer.add_scalar('Evaluate/mean_acc', mean_acc,
+                                          global_step)
+            # Use Tensorboard to visualize results
+            if args.use_tb and cfg.DATASET.VIS_FILE_LIST is not None:
+                visualize(
+                    cfg=cfg,
+                    use_gpu=args.use_gpu,
+                    vis_file_list=cfg.DATASET.VIS_FILE_LIST,
+                    vis_dir="visual",
+                    ckpt_dir=ckpt_dir,
+                    log_writer=log_writer)
+    # save final model
+    if cfg.TRAINER_ID == 0:
+        save_prune_checkpoint(exe, train_prog, 'final')
+def main(args):
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts is not None:
+        cfg.update_from_list(args.opts)
+    cfg.TRAINER_ID = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    cfg.NUM_TRAINERS = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    cfg.check_and_infer()
+    print_info(pprint.pformat(cfg))
+    train(cfg)
+if __name__ == '__main__':
+    args = parse_args()
+    if fluid.core.is_compiled_with_cuda() != True and args.use_gpu == True:
+        print(
+            "You can not set use_gpu = True in the model because you are using paddlepaddle-cpu."
+        )
+        print(
+            "Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_gpu=False to run models on CPU."
+        )
+        sys.exit(1)
+    main(args)
--- a/slim/quantization/README.md
+++ b/slim/quantization/README.md
+>运行该示例前请安装Paddle1.6或更高版本和PaddleSlim
+# 分割模型量化压缩示例
+## 概述
+该示例使用PaddleSlim提供的[量化压缩API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/)对分割模型进行压缩。
+在阅读该示例前，建议您先了解以下内容：
+- [分割模型的常规训练方法](../../docs/usage.md)
+- [PaddleSlim使用文档](https://paddlepaddle.github.io/PaddleSlim/)
+## 安装PaddleSlim
+可按照[PaddleSlim使用文档](https://paddlepaddle.github.io/PaddleSlim/)中的步骤安装PaddleSlim。
+## 训练
+### 数据集
+请按照分割库的教程下载数据集并放到对应位置。
+### 下载训练好的分割模型
+在分割库根目录下运行以下命令：
+```bash
+mkdir pretrain
+cd pretrain
+wget https://paddleseg.bj.bcebos.com/models/mobilenet_cityscapes.tgz
+tar xf mobilenet_cityscapes.tgz
+```
+### 定义量化配置
+config = {
+        'weight_quantize_type': 'channel_wise_abs_max',
+        'activation_quantize_type': 'moving_average_abs_max',
+        'quantize_op_types': ['depthwise_conv2d', 'mul', 'conv2d'],
+        'not_quant_pattern': ['last_conv']
+    }
+如何配置以及含义请参考[PaddleSlim 量化API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/)。
+### 插入量化反量化OP
+使用[PaddleSlim quant_aware API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/#quant_aware)在Program中插入量化和反量化OP。
+```
+compiled_train_prog = quant_aware(train_prog, place, config, for_test=False)
+```
+### 关闭一些训练策略
+因为量化要对Program做修改，所以一些会修改Program的训练策略需要关闭。``sync_batch_norm`` 和量化多卡训练同时使用时会出错, 需要将其关闭。
+```
+build_strategy.fuse_all_reduce_ops = False
+build_strategy.sync_batch_norm = False
+```
+### 开始训练
+step1: 设置gpu卡
+```
+export CUDA_VISIBLE_DEVICES=0
+```
+step2: 将``pdseg``文件夹加到系统路径
+分割库根目录下运行以下命令
+```
+export PYTHONPATH=$PYTHONPATH:./pdseg
+```
+step2: 开始训练
+在分割库根目录下运行以下命令进行训练。
+```
+python -u ./slim/quantization/train_quant.py --log_steps 10 --not_quant_pattern last_conv --cfg configs/deeplabv3p_mobilenetv2_cityscapes.yaml --use_gpu --use_mpio --do_eval \
+TRAIN.PRETRAINED_MODEL_DIR "./pretrain/mobilenet_cityscapes/" \
+TRAIN.MODEL_SAVE_DIR "./snapshots/mobilenetv2_quant" \
+MODEL.DEEPLAB.ENCODER_WITH_ASPP False \
+MODEL.DEEPLAB.ENABLE_DECODER False \
+TRAIN.SYNC_BATCH_NORM False \
+SOLVER.LR 0.0001 \
+TRAIN.SNAPSHOT_EPOCH 1 \
+SOLVER.NUM_EPOCHS 30 \
+BATCH_SIZE 16 \
+```
+### 训练时的模型结构
+[PaddleSlim 量化API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/)文档中介绍了``paddleslim.quant.quant_aware``和``paddleslim.quant.convert``两个接口。
+``paddleslim.quant.quant_aware`` 作用是在网络中的conv2d、depthwise_conv2d、mul等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入。示例图如下：
+<p align="center">
+<img src="./images/TransformPass.png" height=400 width=520 hspace='10'/> <br />
+<strong>图1：应用 paddleslim.quant.quant_aware 后的结果</strong>
+</p>
+### 边训练边测试
+在脚本中边训练边测试得到的测试精度是基于图1中的网络结构进行的。
+## 评估
+### 最终评估模型
+``paddleslim.quant.convert`` 主要用于改变Program中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，``paddleslim.quant.convert`` 还会将`conv2d`、`depthwise_conv2d`、`mul`等算子参数变为量化后的int8_t范围内的值(但数据类型仍为float32)，示例如图2：
+<p align="center">
+<img src="./images/FreezePass.png" height=400 width=420 hspace='10'/> <br />
+<strong>图2：paddleslim.quant.convert 后的结果</strong>
+</p>
+所以在调用 ``paddleslim.quant.convert`` 之后，才得到最终的量化模型。此模型可使用PaddleLite进行加载预测，可参见教程[Paddle-Lite如何加载运行量化模型](https://github.com/PaddlePaddle/Paddle-Lite/wiki/model_quantization)。
+### 评估脚本
+使用脚本[slim/quantization/eval_quant.py](./eval_quant.py)进行评估。
+- 定义配置。使用和训练脚本中一样的量化配置，以得到和量化训练时同样的模型。
+- 使用 ``paddleslim.quant.quant_aware`` 插入量化和反量化op。
+- 使用 ``paddleslim.quant.convert`` 改变op顺序，得到最终量化模型进行评估。
+评估命令：
+分割库根目录下运行
+```
+python -u ./slim/quantization/eval_quant.py  --cfg configs/deeplabv3p_mobilenetv2_cityscapes.yaml  --use_gpu --not_quant_pattern last_conv  --use_mpio --convert \
+TEST.TEST_MODEL "./snapshots/mobilenetv2_quant/best_model" \
+MODEL.DEEPLAB.ENCODER_WITH_ASPP False \
+MODEL.DEEPLAB.ENABLE_DECODER False \
+TRAIN.SYNC_BATCH_NORM False \
+BATCH_SIZE 16 \
+```
+## 量化结果
+## FAQ
--- a/slim/quantization/eval_quant.py
+++ b/slim/quantization/eval_quant.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import sys
+import time
+import argparse
+import functools
+import pprint
+import cv2
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from models.model_builder import build_model
+from models.model_builder import ModelPhase
+from reader import SegDataset
+from metrics import ConfusionMatrix
+from paddleslim.quant import quant_aware, convert
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg model evalution')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess IO or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        '--convert',
+        dest='convert',
+        help='Convert or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        "--not_quant_pattern",
+        nargs='+',
+        type=str,
+        help=
+        "Layers which name_scope contains string in not_quant_pattern will not be quantized"
+    )
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+def evaluate(cfg, ckpt_dir=None, use_gpu=False, use_mpio=False, **kwargs):
+    np.set_printoptions(precision=5, suppress=True)
+    startup_prog = fluid.Program()
+    test_prog = fluid.Program()
+    dataset = SegDataset(
+        file_list=cfg.DATASET.VAL_FILE_LIST,
+        mode=ModelPhase.EVAL,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        #TODO: check is batch reader compatitable with Windows
+        if use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        for b in data_gen:
+            yield b[0], b[1], b[2]
+    py_reader, avg_loss, pred, grts, masks = build_model(
+        test_prog, startup_prog, phase=ModelPhase.EVAL)
+    py_reader.decorate_sample_generator(
+        data_generator, drop_last=False, batch_size=cfg.BATCH_SIZE)
+    # Get device environment
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    place = places[0]
+    dev_count = len(places)
+    print("#Device count: {}".format(dev_count))
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    test_prog = test_prog.clone(for_test=True)
+    not_quant_pattern_list = []
+    if kwargs['not_quant_pattern'] is not None:
+        not_quant_pattern_list = kwargs['not_quant_pattern']
+    config = {
+        'weight_quantize_type': 'channel_wise_abs_max',
+        'activation_quantize_type': 'moving_average_abs_max',
+        'quantize_op_types': ['depthwise_conv2d', 'mul', 'conv2d'],
+        'not_quant_pattern': not_quant_pattern_list
+    }
+    test_prog = quant_aware(test_prog, place, config, for_test=True)
+    ckpt_dir = cfg.TEST.TEST_MODEL if not ckpt_dir else ckpt_dir
+    if not os.path.exists(ckpt_dir):
+        raise ValueError('The TEST.TEST_MODEL {} is not found'.format(ckpt_dir))
+    if ckpt_dir is not None:
+        print('load test model:', ckpt_dir)
+        fluid.io.load_persistables(exe, ckpt_dir, main_program=test_prog)
+    if kwargs['convert']:
+        test_prog = convert(test_prog, place, config)
+    # Use streaming confusion matrix to calculate mean_iou
+    np.set_printoptions(
+        precision=4, suppress=True, linewidth=160, floatmode="fixed")
+    conf_mat = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    fetch_list = [avg_loss.name, pred.name, grts.name, masks.name]
+    num_images = 0
+    step = 0
+    all_step = cfg.DATASET.TEST_TOTAL_IMAGES // cfg.BATCH_SIZE + 1
+    timer = Timer()
+    timer.start()
+    py_reader.start()
+    while True:
+        try:
+            step += 1
+            loss, pred, grts, masks = exe.run(
+                test_prog, fetch_list=fetch_list, return_numpy=True)
+            loss = np.mean(np.array(loss))
+            num_images += pred.shape[0]
+            conf_mat.calculate(pred, grts, masks)
+            _, iou = conf_mat.mean_iou()
+            _, acc = conf_mat.accuracy()
+            speed = 1.0 / timer.elapsed_time()
+            print(
+                "[EVAL]step={} loss={:.5f} acc={:.4f} IoU={:.4f} step/sec={:.2f} | ETA {}"
+                .format(step, loss, acc, iou, speed,
+                        calculate_eta(all_step - step, speed)))
+            timer.restart()
+            sys.stdout.flush()
+        except fluid.core.EOFException:
+            break
+    category_iou, avg_iou = conf_mat.mean_iou()
+    category_acc, avg_acc = conf_mat.accuracy()
+    print("[EVAL]#image={} acc={:.4f} IoU={:.4f}".format(
+        num_images, avg_acc, avg_iou))
+    print("[EVAL]Category IoU:", category_iou)
+    print("[EVAL]Category Acc:", category_acc)
+    print("[EVAL]Kappa:{:.4f}".format(conf_mat.kappa()))
+    return category_iou, avg_iou, category_acc, avg_acc
+def main():
+    args = parse_args()
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    cfg.check_and_infer()
+    print(pprint.pformat(cfg))
+    evaluate(cfg, **args.__dict__)
+if __name__ == '__main__':
+    main()
--- a/slim/quantization/images/ConvertToInt8Pass.png
+++ b/slim/quantization/images/ConvertToInt8Pass.png
--- a/slim/quantization/images/FreezePass.png
+++ b/slim/quantization/images/FreezePass.png
--- a/slim/quantization/images/TransformForMobilePass.png
+++ b/slim/quantization/images/TransformForMobilePass.png
--- a/slim/quantization/images/TransformPass.png
+++ b/slim/quantization/images/TransformPass.png
--- a/slim/quantization/train_quant.py
+++ b/slim/quantization/train_quant.py
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import sys
+import argparse
+import pprint
+import random
+import shutil
+import functools
+import paddle
+import numpy as np
+import paddle.fluid as fluid
+from utils.config import cfg
+from utils.timer import Timer, calculate_eta
+from metrics import ConfusionMatrix
+from reader import SegDataset
+from models.model_builder import build_model
+from models.model_builder import ModelPhase
+from models.model_builder import parse_shape_from_file
+from eval_quant import evaluate
+from vis import visualize
+from utils import dist_utils
+from train import save_vars, save_checkpoint, load_checkpoint, update_best_model, print_info
+from paddleslim.quant import quant_aware
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddleSeg training')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess I/O or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--log_steps',
+        dest='log_steps',
+        help='Display logging information at every log_steps',
+        default=10,
+        type=int)
+    parser.add_argument(
+        '--debug',
+        dest='debug',
+        help='debug mode, display detail information of training',
+        action='store_true')
+    parser.add_argument(
+        '--do_eval',
+        dest='do_eval',
+        help='Evaluation models result on every new checkpoint',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        '--enable_ce',
+        dest='enable_ce',
+        help='If set True, enable continuous evaluation job.'
+        'This flag is only used for internal test.',
+        action='store_true')
+    parser.add_argument(
+        "--not_quant_pattern",
+        nargs='+',
+        type=str,
+        help=
+        "Layers which name_scope contains string in not_quant_pattern will not be quantized"
+    )
+    return parser.parse_args()
+def train_quant(cfg):
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    if args.enable_ce:
+        startup_prog.random_seed = 1000
+        train_prog.random_seed = 1000
+    drop_last = True
+    dataset = SegDataset(
+        file_list=cfg.DATASET.TRAIN_FILE_LIST,
+        mode=ModelPhase.TRAIN,
+        shuffle=True,
+        data_dir=cfg.DATASET.DATA_DIR)
+    def data_generator():
+        if args.use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+        batch_data = []
+        for b in data_gen:
+            batch_data.append(b)
+            if len(batch_data) == (cfg.BATCH_SIZE // cfg.NUM_TRAINERS):
+                for item in batch_data:
+                    yield item[0], item[1], item[2]
+                batch_data = []
+        # If use sync batch norm strategy, drop last batch if number of samples
+        # in batch_data is less then cfg.BATCH_SIZE to avoid NCCL hang issues
+        if not cfg.TRAIN.SYNC_BATCH_NORM:
+            for item in batch_data:
+                yield item[0], item[1], item[2]
+    # Get device environment
+    # places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # place = places[0]
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+    places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+    # Get number of GPU
+    dev_count = cfg.NUM_TRAINERS if cfg.NUM_TRAINERS > 1 else len(places)
+    print_info("#Device count: {}".format(dev_count))
+    # Make sure BATCH_SIZE can divided by GPU cards
+    assert cfg.BATCH_SIZE % dev_count == 0, (
+        'BATCH_SIZE:{} not divisble by number of GPUs:{}'.format(
+            cfg.BATCH_SIZE, dev_count))
+    # If use multi-gpu training mode, batch data will allocated to each GPU evenly
+    batch_size_per_dev = cfg.BATCH_SIZE // dev_count
+    print_info("batch_size_per_dev: {}".format(batch_size_per_dev))
+    py_reader, avg_loss, lr, pred, grts, masks = build_model(
+        train_prog, startup_prog, phase=ModelPhase.TRAIN)
+    py_reader.decorate_sample_generator(
+        data_generator, batch_size=batch_size_per_dev, drop_last=drop_last)
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    exec_strategy = fluid.ExecutionStrategy()
+    # Clear temporary variables every 100 iteration
+    if args.use_gpu:
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+    exec_strategy.num_iteration_per_drop_scope = 100
+    build_strategy = fluid.BuildStrategy()
+    if cfg.NUM_TRAINERS > 1 and args.use_gpu:
+        dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog)
+        exec_strategy.num_threads = 1
+    # Resume training
+    begin_epoch = cfg.SOLVER.BEGIN_EPOCH
+    if cfg.TRAIN.RESUME_MODEL_DIR:
+        begin_epoch = load_checkpoint(exe, train_prog)
+    # Load pretrained model
+    elif os.path.exists(cfg.TRAIN.PRETRAINED_MODEL_DIR):
+        print_info('Pretrained model dir: ', cfg.TRAIN.PRETRAINED_MODEL_DIR)
+        load_vars = []
+        load_fail_vars = []
+        def var_shape_matched(var, shape):
+            """
+            Check whehter persitable variable shape is match with current network
+            """
+            var_exist = os.path.exists(
+                os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+            if var_exist:
+                var_shape = parse_shape_from_file(
+                    os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                return var_shape == shape
+            return False
+        for x in train_prog.list_vars():
+            if isinstance(x, fluid.framework.Parameter):
+                shape = tuple(fluid.global_scope().find_var(
+                    x.name).get_tensor().shape())
+                if var_shape_matched(x, shape):
+                    load_vars.append(x)
+                else:
+                    load_fail_vars.append(x)
+        fluid.io.load_vars(
+            exe, dirname=cfg.TRAIN.PRETRAINED_MODEL_DIR, vars=load_vars)
+        for var in load_vars:
+            print_info("Parameter[{}] loaded sucessfully!".format(var.name))
+        for var in load_fail_vars:
+            print_info(
+                "Parameter[{}] don't exist or shape does not match current network, skip"
+                " to load it.".format(var.name))
+        print_info("{}/{} pretrained parameters loaded successfully!".format(
+            len(load_vars),
+            len(load_vars) + len(load_fail_vars)))
+    else:
+        print_info(
+            'Pretrained model dir {} not exists, training from scratch...'.
+            format(cfg.TRAIN.PRETRAINED_MODEL_DIR))
+    fetch_list = [avg_loss.name, lr.name]
+    if args.debug:
+        # Fetch more variable info and use streaming confusion matrix to
+        # calculate IoU results if in debug mode
+        np.set_printoptions(
+            precision=4, suppress=True, linewidth=160, floatmode="fixed")
+        fetch_list.extend([pred.name, grts.name, masks.name])
+        cm = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    not_quant_pattern = []
+    if args.not_quant_pattern:
+        not_quant_pattern = args.not_quant_pattern
+    config = {
+        'weight_quantize_type': 'channel_wise_abs_max',
+        'activation_quantize_type': 'moving_average_abs_max',
+        'quantize_op_types': ['depthwise_conv2d', 'mul', 'conv2d'],
+        'not_quant_pattern': not_quant_pattern
+    }
+    compiled_train_prog = quant_aware(train_prog, place, config, for_test=False)
+    eval_prog = quant_aware(train_prog, place, config, for_test=True)
+    build_strategy.fuse_all_reduce_ops = False
+    build_strategy.sync_batch_norm = False
+    compiled_train_prog = compiled_train_prog.with_data_parallel(
+        loss_name=avg_loss.name,
+        exec_strategy=exec_strategy,
+        build_strategy=build_strategy)
+    # trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    # num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    global_step = 0
+    all_step = cfg.DATASET.TRAIN_TOTAL_IMAGES // cfg.BATCH_SIZE
+    if cfg.DATASET.TRAIN_TOTAL_IMAGES % cfg.BATCH_SIZE and drop_last != True:
+        all_step += 1
+    all_step *= (cfg.SOLVER.NUM_EPOCHS - begin_epoch + 1)
+    avg_loss = 0.0
+    best_mIoU = 0.0
+    timer = Timer()
+    timer.start()
+    if begin_epoch > cfg.SOLVER.NUM_EPOCHS:
+        raise ValueError(
+            ("begin epoch[{}] is larger than cfg.SOLVER.NUM_EPOCHS[{}]").format(
+                begin_epoch, cfg.SOLVER.NUM_EPOCHS))
+    if args.use_mpio:
+        print_info("Use multiprocess reader")
+    else:
+        print_info("Use multi-thread reader")
+    for epoch in range(begin_epoch, cfg.SOLVER.NUM_EPOCHS + 1):
+        py_reader.start()
+        while True:
+            try:
+                if args.debug:
+                    # Print category IoU and accuracy to check whether the
+                    # traning process is corresponed to expectation
+                    loss, lr, pred, grts, masks = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    cm.calculate(pred, grts, masks)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0:
+                        speed = args.log_steps / timer.elapsed_time()
+                        avg_loss /= args.log_steps
+                        category_acc, mean_acc = cm.accuracy()
+                        category_iou, mean_iou = cm.mean_iou()
+                        print_info((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} acc={:.5f} mIoU={:.5f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, mean_acc,
+                                 mean_iou, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        print_info("Category IoU: ", category_iou)
+                        print_info("Category Acc: ", category_acc)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        cm.zero_matrix()
+                        timer.restart()
+                else:
+                    # If not in debug mode, avoid unnessary log and calculate
+                    loss, lr = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+                    if global_step % args.log_steps == 0 and cfg.TRAINER_ID == 0:
+                        avg_loss /= args.log_steps
+                        speed = args.log_steps / timer.elapsed_time()
+                        print((
+                            "epoch={} step={} lr={:.5f} loss={:.4f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, global_step, lr[0], avg_loss, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        timer.restart()
+            except fluid.core.EOFException:
+                py_reader.reset()
+                break
+            except Exception as e:
+                print(e)
+        if (epoch % cfg.TRAIN.SNAPSHOT_EPOCH == 0
+                or epoch == cfg.SOLVER.NUM_EPOCHS) and cfg.TRAINER_ID == 0:
+            ckpt_dir = save_checkpoint(exe, eval_prog, epoch)
+            if args.do_eval:
+                print("Evaluation start")
+                _, mean_iou, _, mean_acc = evaluate(
+                    cfg=cfg,
+                    ckpt_dir=ckpt_dir,
+                    use_gpu=args.use_gpu,
+                    use_mpio=args.use_mpio,
+                    not_quant_pattern=args.not_quant_pattern,
+                    convert=False)
+                if mean_iou > best_mIoU:
+                    best_mIoU = mean_iou
+                    update_best_model(ckpt_dir)
+                    print_info("Save best model {} to {}, mIoU = {:.4f}".format(
+                        ckpt_dir,
+                        os.path.join(cfg.TRAIN.MODEL_SAVE_DIR, 'best_model'),
+                        mean_iou))
+    # save final model
+    if cfg.TRAINER_ID == 0:
+        save_checkpoint(exe, eval_prog, 'final')
+def main(args):
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    if args.enable_ce:
+        random.seed(0)
+        np.random.seed(0)
+    cfg.TRAINER_ID = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    cfg.NUM_TRAINERS = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    cfg.check_and_infer()
+    print_info(pprint.pformat(cfg))
+    train_quant(cfg)
+if __name__ == '__main__':
+    args = parse_args()
+    if fluid.core.is_compiled_with_cuda() != True and args.use_gpu == True:
+        print(
+            "You can not set use_gpu = True in the model because you are using paddlepaddle-cpu."
+        )
+        print(
+            "Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_gpu=False to run models on CPU."
+        )
+        sys.exit(1)
+    main(args)
--- a/turtorial/finetune_fast_scnn.md
+++ b/turtorial/finetune_fast_scnn.md
+# Fast-SCNN模型训练教程
+* 本教程旨在介绍如何通过使用PaddleSeg提供的 ***`Fast_scnn_cityscapes`*** 预训练模型在自定义数据集上进行训练。
+* 在阅读本教程前，请确保您已经了解过PaddleSeg的[快速入门](../README.md#快速入门)和[基础功能](../README.md#基础功能)等章节，以便对PaddleSeg有一定的了解
+* 本教程的所有命令都基于PaddleSeg主目录进行执行
+## 一. 准备待训练数据
+我们提前准备好了一份数据集，通过以下代码进行下载
+```shell
+python dataset/download_pet.py
+```
+## 二. 下载预训练模型
+```shell
+python pretrained_model/download_model.py fast_scnn_cityscapes
+```
+## 三. 准备配置
+接着我们需要确定相关配置，从本教程的角度，配置分为三部分：
+* 数据集
+  * 训练集主目录
+  * 训练集文件列表
+  * 测试集文件列表
+  * 评估集文件列表
+* 预训练模型
+  * 预训练模型名称
+  * 预训练模型的backbone网络
+  * 预训练模型的Normalization类型
+  * 预训练模型路径
+* 其他
+  * 学习率
+  * Batch大小
+  * ...
+在三者中，预训练模型的配置尤为重要，如果模型或者BACKBONE配置错误，会导致预训练的参数没有加载，进而影响收敛速度。预训练模型相关的配置如第二步所展示。
+数据集的配置和数据路径有关，在本教程中，数据存放在`dataset/mini_pet`中
+其他配置则根据数据集和机器环境的情况进行调节，最终我们保存一个如下内容的yaml配置文件，存放路径为**configs/fast_scnn_pet.yaml**
+```yaml
+# 数据集配置
+DATASET:
+    DATA_DIR: "./dataset/mini_pet/"
+    NUM_CLASSES: 3
+    TEST_FILE_LIST: "./dataset/mini_pet/file_list/test_list.txt"
+    TRAIN_FILE_LIST: "./dataset/mini_pet/file_list/train_list.txt"
+    VAL_FILE_LIST: "./dataset/mini_pet/file_list/val_list.txt"
+    VIS_FILE_LIST: "./dataset/mini_pet/file_list/test_list.txt"
+# 预训练模型配置
+MODEL:
+    MODEL_NAME: "fast_scnn"
+    DEFAULT_NORM_TYPE: "bn"
+# 其他配置
+TRAIN_CROP_SIZE: (512, 512)
+EVAL_CROP_SIZE: (512, 512)
+AUG:
+    AUG_METHOD: "unpadding"
+    FIX_RESIZE_SIZE: (512, 512)
+BATCH_SIZE: 4
+TRAIN:
+    PRETRAINED_MODEL_DIR: "./pretrained_model/fast_scnn_cityscape/"
+    MODEL_SAVE_DIR: "./saved_model/fast_scnn_pet/"
+    SNAPSHOT_EPOCH: 10
+TEST:
+    TEST_MODEL: "./saved_model/fast_scnn_pet/final"
+SOLVER:
+    NUM_EPOCHS: 100
+    LR: 0.005
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+```
+## 四. 配置/数据校验
+在开始训练和评估之前，我们还需要对配置和数据进行一次校验，确保数据和配置是正确的。使用下述命令启动校验流程
+```shell
+python pdseg/check.py --cfg ./configs/fast_scnn_pet.yaml
+```
+## 五. 开始训练
+校验通过后，使用下述命令启动训练
+```shell
+python pdseg/train.py --use_gpu --cfg ./configs/fast_scnn_pet.yaml
+```
+## 六. 进行评估
+模型训练完成，使用下述命令启动评估
+```shell
+python pdseg/eval.py --use_gpu --cfg ./configs/fast_scnn_pet.yaml
+```
+## 七. 实时分割模型推理时间比较
+| 模型 | eval size | inference time | mIoU on cityscape val|
+|---|---|---|---|
+| DeepLabv3+/MobileNetv2/bn | (1024, 2048) |16.14ms| 0.698|
+| ICNet/bn |(1024, 2048) |8.76ms| 0.6831 |
+| Fast-SCNN/bn | (1024, 2048) |6.28ms| 0.6964 |
+上述测试环境为v100.