Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSlim into update_ofa

c67e3f88 · ceci3 · 7b33f4d6 · caf6ba53 · c67e3f88 · c67e3f88
404 changed file
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "demo/ocr/PaddleOCR"]
+	path = demo/ocr/PaddleOCR
+	url = https://github.com/PaddlePaddle/PaddleOCR
--- a/.style.yapf
+++ b/.style.yapf
+[style]
+based_on_style = pep8
+column_limit = 80
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+
--- a/README.md
+++ b/README.md
-
-
 # PaddleSlim

-PaddleSlim是PaddlePaddle框架的一个子模块，主要用于压缩图像领域模型。在PaddleSlim中，不仅实现了目前主流的网络剪枝、量化、蒸馏三种压缩策略，还实现了超参数搜索和小模型网络结构搜索功能。在后续版本中，会添加更多的压缩策略，以及完善对NLP领域模型的支持。
+中文 | [English](README_en.md)

-## 功能
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://paddleslim.readthedocs.io/en/latest/)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://paddleslim.readthedocs.io/zh_CN/latest/)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)

- 模型剪裁
-  - 支持通道均匀模型剪裁（uniform pruning)
-  - 基于敏感度的模型剪裁
-  - 基于进化算法的自动模型剪裁三种方式
+PaddleSlim是一个模型压缩工具库，包含模型剪裁、定点量化、知识蒸馏、超参搜索和模型结构搜索等一系列模型压缩策略。

- 量化训练
-  - 在线量化训练（training aware）
-  - 离线量化（post training）
-  - 支持对权重全局量化和Channel-Wise量化
+对于业务用户，PaddleSlim提供完整的模型压缩解决方案，可用于图像分类、检测、分割等各种类型的视觉场景。
+同时也在持续探索NLP领域模型的压缩方案。另外，PaddleSlim提供且在不断完善各种压缩策略在经典开源任务的benchmark,
+以便业务用户参考。

- 蒸馏
+对于模型压缩算法研究者或开发者，PaddleSlim提供各种压缩策略的底层辅助接口，方便用户复现、调研和使用最新论文方法。
+PaddleSlim会从底层能力、技术咨询合作和业务场景等角度支持开发者进行模型压缩策略相关的创新工作。

- 轻量神经网络结构自动搜索（Light-NAS）
-  - 支持基于进化算法的轻量神经网络结构自动搜索（Light-NAS）
-  - 支持 FLOPS / 硬件延时约束
-  - 支持多平台模型延时评估

+## 功能

-## 安装
+<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000">
+	<tbody>
+		<tr>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">功能模块</span>
+			</td>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">算法</span>
+			</td>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">教程</span><span style="font-size:18px;">与文档</span>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				<span style="font-size:12px;">剪裁</span><span style="font-size:12px;"></span><br />
+			</td>
+			<td>
+				<ul>
+					<li>
+						Sensitivity&nbsp;&nbsp;Pruner:&nbsp;<a href="https://arxiv.org/abs/1608.08710" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Li H , Kadav A , Durdanovic I , et al. Pruning Filters for Efficient ConvNets[J]. 2016.</span></span></a>
+					</li>
+					<li>
+						AMC Pruner:&nbsp;<a href="https://arxiv.org/abs/1802.03494" target="_blank"><span style="font-family:&quot;font-size:13px;background-color:#FFFFFF;">He, Yihui , et al. "AMC: AutoML for Model Compression and Acceleration on Mobile Devices." (2018).</span></a>
+					</li>
+					<li>
+						FPGM Pruner:&nbsp;<a href="https://arxiv.org/abs/1811.00250" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">He Y , Liu P , Wang Z , et al. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration[C]// IEEE/CVF Conference on Computer Vision &amp; Pattern Recognition. IEEE, 2019.</span></a>
+					</li>
+					<li>
+						Slim Pruner:<span style="background-color:#FFFDFA;">&nbsp;<a href="https://arxiv.org/pdf/1708.06519.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Liu Z , Li J , Shen Z , et al. Learning Efficient Convolutional Networks through Network Slimming[J]. 2017.</span></a></span>
+					</li>
+					<li>
+						<span style="background-color:#FFFDFA;">Opt Slim Pruner:&nbsp;<a href="https://arxiv.org/pdf/2003.04566.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Ye Y , You G , Fwu J K , et al. Channel Pruning via Optimal Thresholding[J]. 2020.</span></a><br />
+</span>
+					</li>
+				</ul>
+			</td>
+			<td>
+					<ul>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/prune_api.rst" target="_blank">剪裁模块API文档</a>
+						</li>
+					        <li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/pruning_tutorial.md" target="_blank">剪裁快速开始示例</a>
+						</li>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md" target="_blank">分类模敏感度分析教程</a>
+						</li>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_pruing_tutorial.md" target="_blank">检测模型剪裁教程</a>
+						</li>
+						<li>
+								<span id="__kindeditor_bookmark_start_313__"></span><a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_prune_dist_tutorial.md" target="_blank">检测模型剪裁+蒸馏教程</a>
+						</li>
+						<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_sensitivy_tutorial.md" target="_blank">检测模型敏感度分析教程</a>
+						</li>
+					</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				量化
+			</td>
+			<td>
+				<ul>
+					<li>
+						Quantization Aware Training:&nbsp;<a href="https://arxiv.org/abs/1806.08342" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Krishnamoorthi R . Quantizing deep convolutional networks for efficient inference: A whitepaper[J]. 2018.</span></a>
+					</li>
+					<li>
+						Post Training&nbsp;<span>Quantization&nbsp;</span><a href="http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf" target="_blank">原理</a>
+					</li>
+					<li>
+						Embedding&nbsp;<span>Quantization:&nbsp;<a href="https://arxiv.org/pdf/1603.01025.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Miyashita D , Lee E H , Murmann B . Convolutional Neural Networks using Logarithmic Data Representation[J]. 2016.</span></a></span>
+					</li>
+					<li>
+						DSQ: <a href="https://arxiv.org/abs/1908.05033" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Gong, Ruihao, et al. "Differentiable soft quantization: Bridging full-precision and low-bit neural networks."&nbsp;</span><i>Proceedings of the IEEE International Conference on Computer Vision</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2019.</span></a>
+					</li>
+					<li>
+						PACT:&nbsp; <a href="https://arxiv.org/abs/1805.06085" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Choi, Jungwook, et al. "Pact: Parameterized clipping activation for quantized neural networks."&nbsp;</span><i>arXiv preprint arXiv:1805.06085</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2018).</span></a>
+					</li>
+				</ul>
+			</td>
+			<td>
+				<ul>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/quantization_api.rst" target="_blank">量化API文档</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/quant_aware_tutorial.md" target="_blank">量化训练快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/quant_post_static_tutorial.md" target="_blank">静态离线量化快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md" target="_blank">检测模型量化教程</a>
+					</li>
+				</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				蒸馏
+			</td>
+			<td>
+				<ul>
+					<li>
+						<span>Knowledge Distillation</span>:&nbsp;<a href="https://arxiv.org/abs/1503.02531" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network."&nbsp;</span><i>arXiv preprint arXiv:1503.02531</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2015).</span></a>
+					</li>
+					<li>
+						FSP <span>Knowledge Distillation</span>:&nbsp;&nbsp;<a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Yim_A_Gift_From_CVPR_2017_paper.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Yim, Junho, et al. "A gift from knowledge distillation: Fast optimization, network minimization and transfer learning."&nbsp;</span><i>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2017.</span></a>
+					</li>
+					<li>
+						YOLO Knowledge Distillation:&nbsp;&nbsp;<a href="http://openaccess.thecvf.com/content_ECCVW_2018/papers/11133/Mehta_Object_detection_at_200_Frames_Per_Second_ECCVW_2018_paper.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Mehta, Rakesh, and Cemalettin Ozturk. "Object detection at 200 frames per second."&nbsp;</span><i>Proceedings of the European Conference on Computer Vision (ECCV)</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2018.</span></a>
+					</li>
+					<li>
+						DML:&nbsp;<a href="https://arxiv.org/abs/1706.00384" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Zhang, Ying, et al. "Deep mutual learning."&nbsp;</span><i>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2018.</span></a>
+					</li>
+				</ul>
+			</td>
+			<td>
+				<ul>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/single_distiller_api.rst" target="_blank">蒸馏API文档</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/distillation_tutorial.md" target="_blank">蒸馏快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_distillation_tutorial.md" target="_blank">检测模型蒸馏教程</a>
+					</li>
+				</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				模型结构搜索(NAS)
+			</td>
+			<td>
+				<ul>
+					<li>
+						Simulate Anneal NAS:&nbsp;<a href="https://arxiv.org/pdf/2005.04117.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Abdelhamed, Abdelrahman, et al. "Ntire 2020 challenge on real image denoising: Dataset, methods and results."&nbsp;</span><i>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. Vol. 2. 2020.</span></a>
+					</li>
+					<li>
+						DARTS <a href="https://arxiv.org/abs/1806.09055" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search."&nbsp;</span><i>arXiv preprint arXiv:1806.09055</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2018).</span></a>
+					</li>
+					<li>
+						PC-DARTS <a href="https://arxiv.org/abs/1907.05737" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Xu, Yuhui, et al. "Pc-darts: Partial channel connections for memory-efficient differentiable architecture search."&nbsp;</span><i>arXiv preprint arXiv:1907.05737</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2019).</span></a>
+					</li>
+					<li>
+						OneShot&nbsp;
+					</li>
+				</ul>
+			</td>
+			<td>
+						<ul>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/nas_api.rst" target="_blank">NAS API文档</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/darts.rst" target="_blank">DARTS API文档</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/nas_tutorial.md" target="_blank">NAS快速开始示例</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_nas_tutorial.md" target="_blank">检测模型NAS教程</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/sanas_darts_space.md" target="_blank">SANAS进阶版实验教程-压缩DARTS产出模型</a>
+							</li>
+						</ul>
+			</td>
+		</tr>
+	</tbody>
+</table>
+<br />

-安装PaddleSlim前，请确认已正确安装Paddle1.6版本或更新版本。Paddle安装请参考：[Paddle安装教程](https://www.paddlepaddle.org.cn/install/quick)。
+## 安装

+```bash
+pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### 量化和Paddle版本的对应关系

- 安装develop版本
+如果在ARM和GPU上预测，每个版本都可以，如果在CPU上预测，请选择Paddle 2.0对应的PaddleSlim 1.1.0版本

+- Paddle 1.7 系列版本，需要安装PaddleSlim 1.0.1版本

-```
-git clone https://github.com/PaddlePaddle/PaddleSlim.git
-cd PaddleSlim
-python setup.py install
+```bash
+pip install paddleslim==1.0.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```

- 安装官方发布的最新版本
+- Paddle 1.8 系列版本，需要安装PaddleSlim 1.1.1版本

+```bash
+pip install paddleslim==1.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
-pip install paddleslim -i https://pypi.org/simple
+
+- Paddle 2.0 系列版本，需要安装PaddleSlim 1.1.0版本
+
+```bash
+pip install paddleslim==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```

- 安装历史版本

-请点击[pypi.org](https://pypi.org/project/paddleslim/#history)查看可安装历史版本。

 ## 使用

- [API文档](doc/api_guide.md)：API使用介绍，包括[蒸馏]()、[剪裁]()、[量化]()和[模型结构搜索]()。
- [示例](doc/demo_guide.md)：基于mnist和cifar10等简单分类任务的模型压缩示例，您可以通过该部分快速体验和了解PaddleSlim的功能。
- [实践教程]()：经典模型的分析和压缩实验教程。
- [模型库]()：经过压缩的分类、检测、语义分割模型，包括权重文件、网络结构文件和性能数据。
- [Paddle检测库]()：介绍如何在检测库中使用PaddleSlim。
- [Paddle分割库]()：介绍如何在分割库中使用PaddleSlim。
- [PaddleLite]()：介绍如何使用预测库PaddleLite部署PaddleSlim产出的模型。
+- [快速开始](docs/zh_cn/quick_start)：通过简单示例介绍如何快速使用PaddleSlim。
+- [进阶教程](docs/zh_cn/tutorials)：PaddleSlim高阶教程。
+- [模型库](docs/zh_cn/model_zoo.md)：各个压缩策略在图像分类、目标检测和图像语义分割模型上的实验结论，包括模型精度、预测速度和可供下载的预训练模型。
+- [API文档](https://paddlepaddle.github.io/PaddleSlim/api_cn/index.html)
+- [算法原理](https://paddlepaddle.github.io/PaddleSlim/algo/algo.html): 介绍量化、剪枝、蒸馏、NAS的基本知识背景。
+- 视觉模型压缩
+    - [SlimMobileNet](paddleslim/models#slimmobilenet系列指标)
+    - [SlimFaceNet](demo/slimfacenet/README.md)
+    - [OCR模型压缩(基于PaddleOCR)](demo/ocr/README.md)
+    - [检测模型压缩(基于PaddleDetection)](demo/detection/README.md)
+
+## 部分压缩策略效果
+
+### 分类模型
+
+数据: ImageNet2012; 模型: MobileNetV1;
+
+|压缩策略 |精度收益(baseline: 70.91%) |模型大小(baseline: 17.0M)|
+|:---:|:---:|:---:|
+| 知识蒸馏(ResNet50)| [+1.06%]() |-|
+| 知识蒸馏(ResNet50) + int8量化训练 |[+1.10%]()| [-71.76%]()|
+| 剪裁(FLOPs-50%) + int8量化训练|[-1.71%]()|[-86.47%]()|
+
+
+### 图像检测模型
+
+#### 数据：Pascal VOC；模型：MobileNet-V1-YOLOv3
+
+|        压缩方法           | mAP(baseline: 76.2%)         | 模型大小(baseline: 94MB)      |
+| :---------------------:   | :------------: | :------------:|
+| 知识蒸馏(ResNet34-YOLOv3) | [+2.8%](#)      |       -       |
+| 剪裁 FLOPs -52.88%        | [+1.4%]()      | [-67.76%]()   |
+|知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-69.57%)| [+2.6%]()|[-67.00%]()|
+
+
+#### 数据：COCO；模型：MobileNet-V1-YOLOv3
+
+|        压缩方法           | mAP(baseline: 29.3%) | 模型大小|
+| :---------------------:   | :------------: | :------:|
+| 知识蒸馏(ResNet34-YOLOv3) |  [+2.1%]()     |-|
+| 知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-67.56%) | [-0.3%]() | [-66.90%]()|
+
+### 搜索
+
+数据：ImageNet2012; 模型：MobileNetV2
+
+|硬件环境           | 推理耗时 | Top1准确率(baseline:71.90%) |
+|:---------------:|:---------:|:--------------------:|
+| RK3288  | [-23%]()    | +0.07%    |
+| Android cellphone  | [-20%]()    | +0.16% |
+| iPhone 6s   | [-17%]()    | +0.32%  |
+
+## 许可证书
+本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
+
+## 如何贡献代码

-## 贡献与反馈
+我们非常欢迎你可以为PaddleSlim提供代码，也十分感谢你的反馈。
--- a/README_en.md
+++ b/README_en.md
+[中文](README.md) | English
+
+Documents：https://paddlepaddle.github.io/PaddleSlim
+
+# PaddleSlim
+
+PaddleSlim is a toolkit for model compression. It contains a collection of compression strategies, such as pruning, fixed point quantization, knowledge distillation, hyperparameter searching and neural architecture search.
+
+PaddleSlim provides solutions of compression on computer vision models, such as image classification, object detection and semantic segmentation. Meanwhile, PaddleSlim Keeps exploring advanced compression strategies for language model. Furthermore, benckmark of compression strategies on some open tasks is available for your reference.
+
+PaddleSlim also provides auxiliary and primitive API for developer and researcher to survey, implement and apply the method in latest papers. PaddleSlim will support developer in ability of framework and technology consulting.
+
+## Features
+
+### Pruning
+
+  - Uniform pruning of convolution
+  - Sensitivity-based prunning
+  - Automated pruning based evolution search strategy
+  - Support pruning of various deep architectures such as VGG, ResNet, and MobileNet.
+  - Support self-defined range of pruning, i.e., layers to be pruned.
+
+### Fixed Point Quantization
+
+  - **Training aware**
+    - Dynamic strategy: During inference, we quantize models with hyperparameters dynamically estimated from small batches of samples.
+    - Static strategy: During inference, we quantize models with the same hyperparameters estimated from training data.
+    - Support layer-wise and channel-wise quantization.
+  - **Post training**
+
+### Knowledge Distillation
+
+  - **Naive knowledge distillation:** transfers dark knowledge by merging the teacher and student model into the same Program
+  - **Paddle large-scale scalable knowledge distillation framework Pantheon:** a universal solution for knowledge distillation, more flexible than the naive knowledge distillation, and easier to scale to the large-scale applications.
+
+    - Decouple the teacher and student models --- they run in different processes in the same or different nodes, and transfer knowledge via TCP/IP ports or local files;
+    - Friendly to assemble multiple teacher models and each of them can work in either online or offline mode independently;
+    - Merge knowledge from different teachers and make batch data for the student model automatically;
+    - Support the large-scale knowledge prediction of teacher models on multiple devices.
+
+### Neural Architecture Search
+
+  - Neural architecture search based on evolution strategy.
+  - Support distributed search.
+  - One-Shot neural architecture search.
+  - Support FLOPs and latency constrained search.
+  - Support the latency estimation on different hardware and platforms.
+
+## Install
+
+Requires:
+
+Paddle >= 1.7.0
+
+```bash
+pip install paddleslim -i https://pypi.org/simple
+```
+
+### quantization
+
+If you want to use quantization in PaddleSlim, please install PaddleSlim as follows.
+
+If you want to use quantized model in ARM and GPU, any PaddleSlim version is ok and you should install 1.1.0 for CPU.
+
+- For Paddle 1.7, install PaddleSlim 1.0.1
+
+```bash
+pip install paddleslim==1.0.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+- For Paddle 1.8，install PaddleSlim 1.1.1
+
+```bash
+pip install paddleslim==1.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+- For Paddle 2.0 ，install PaddleSlim 1.1.0
+
+```bash
+pip install paddleslim==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+## Usage
+
+- [QuickStart](https://paddlepaddle.github.io/PaddleSlim/quick_start/index_en.html): Introduce how to use PaddleSlim by simple examples.
+- [Advanced Tutorials](https://paddlepaddle.github.io/PaddleSlim/tutorials/index_en.html)：Tutorials about advanced usage of PaddleSlim.
+- [Model Zoo](https://paddlepaddle.github.io/PaddleSlim/model_zoo_en.html)：Benchmark and pretrained models.
+- [API Documents](https://paddlepaddle.github.io/PaddleSlim/api_en/index_en.html)
+- [Algorithm Background](https://paddlepaddle.github.io/PaddleSlim/algo/algo.html): Introduce the background of quantization, pruning, distillation, NAS.
+- [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim): Introduce how to use PaddleSlim in PaddleDetection library.
+- [PaddleSeg](https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim): Introduce how to use PaddleSlim in PaddleSeg library.
+- [PaddleLite](https://paddlepaddle.github.io/Paddle-Lite/): How to use PaddleLite to deploy models generated by PaddleSlim.
+
+## Performance
+
+### Image Classification
+
+Dataset: ImageNet2012; Model: MobileNetV1;
+
+|Method |Accuracy(baseline: 70.91%) |Model Size(baseline: 17.0M)|
+|:---:|:---:|:---:|
+| Knowledge Distillation(ResNet50)| [+1.06%]() |-|
+| Knowledge Distillation(ResNet50) + int8 quantization |[+1.10%]()| [-71.76%]()|
+| Pruning(FLOPs-50%) + int8 quantization|[-1.71%]()|[-86.47%]()|
+
+
+### Object Detection
+
+#### Dataset: Pascal VOC; Model: MobileNet-V1-YOLOv3
+
+|        Method           | mAP(baseline: 76.2%)         | Model Size(baseline: 94MB)      |
+| :---------------------:   | :------------: | :------------:|
+| Knowledge Distillation(ResNet34-YOLOv3) | [+2.8%]()      |       -       |
+| Pruning(FLOPs -52.88%)        | [+1.4%]()      | [-67.76%]()   |
+|Knowledge DistillationResNet34-YOLOv3)+Pruning(FLOPs-69.57%)| [+2.6%]()|[-67.00%]()|
+
+
+#### Dataset: COCO; Model: MobileNet-V1-YOLOv3
+
+|        Method           | mAP(baseline: 29.3%) | Model Size|
+| :---------------------:   | :------------: | :------:|
+| Knowledge Distillation(ResNet34-YOLOv3) |  [+2.1%]()     |-|
+| Knowledge Distillation(ResNet34-YOLOv3)+Pruning(FLOPs-67.56%) | [-0.3%]() | [-66.90%]()|
+
+### NAS
+
+Dataset: ImageNet2012; Model: MobileNetV2
+
+|Device           | Infer time cost | Top1 accuracy(baseline:71.90%) |
+|:---------------:|:---------:|:--------------------:|
+| RK3288  | [-23%]()    | +0.07%    |
+| Android cellphone  | [-20%]()    | +0.16% |
+| iPhone 6s   | [-17%]()    | +0.32%  |
--- a/demo/auto_prune/README.md
+++ b/demo/auto_prune/README.md
+该示例介绍如何使用自动裁剪。
+该示例使用默认会自动下载并使用MNIST数据。支持以下模型：
+
+- MobileNetV1
+- MobileNetV2
+- ResNet50
+
+## 1. 接口介绍
+
+该示例涉及以下接口：
+
+- [paddleslim.prune.AutoPruner])
+- [paddleslim.prune.Pruner])
+
+## 2. 运行示例
+
+
+提供两种自动裁剪模式，直接以裁剪目标进行一次自动裁剪，和多次迭代的方式进行裁剪。 
+
+###2.1一次裁剪
+
+在路径`PaddleSlim/demo/auto_prune`下执行以下代码运行示例：
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python train.py --model "MobileNet"
+从log中获取搜索的最佳裁剪率列表，加入到train_finetune.py的ratiolist中，如下命令finetune得到最终结果
+python train_finetune.py --model "MobileNet" --lr 0.1 --num_epochs 120 --step_epochs 30 60 90
+
+```
+
+通过`python train.py --help`查看更多选项。
+
+
+###2.2多次迭代裁剪
+
+在路径`PaddleSlim/demo/auto_prune`下执行以下代码运行示例：
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python train_iterator.py --model "MobileNet"
+从log中获取本次迭代搜索的最佳裁剪率列表，加入到train_finetune.py的ratiolist中,如下命令开始finetune本次搜索到的结果
+python train_finetune.py --model "MobileNet"
+将第一次迭代的最佳裁剪率列表，加入到train_iterator.py 的ratiolist中,如下命令进行第二次迭代
+python train_iterator.py --model "MobileNet" --pretrained_model "checkpoint/Mobilenet/19"
+finetune第二次迭代搜索结果，并继续重复迭代，直到获得目标裁剪率的结果
+...
+如下命令finetune最终结果
+python train_finetune.py --model "MobileNet" --pretrained_model "checkpoint/Mobilenet/19"  --num_epochs 70 --step_epochs 10 40
+```
+
+
+## 3. 注意
+
+### 3.1 一次裁剪
+
+在`paddleslim.prune.AutoPruner`接口的参数中，pruned_flops表示期望的最低flops剪切率。
+
+
+### 3.2 多次迭代裁剪
+
+单次迭代的裁剪目标，建议不高于10%。
+在load前次迭代结果时，需要删除checkpoint下learning_rate、@LR_DECAY_COUNTER@等文件，避免继承之前的learning_rate，影响finetune效果。
--- a/demo/auto_prune/train.py
+++ b/demo/auto_prune/train.py
@@ -116,8 +116,8 @@ def compress(args):

        fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)

-    val_reader = paddle.batch(val_reader, batch_size=args.batch_size)
-    train_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(val_reader, batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
        train_reader, batch_size=args.batch_size, drop_last=True)

    train_feeder = feeder = fluid.DataFeeder([image, label], place)

--- a/demo/auto_prune/train_finetune.py
+++ b/demo/auto_prune/train_finetune.py
+import os
+import sys
+import logging
+import paddle
+import argparse
+import functools
+import math
+import paddle.fluid as fluid
+import imagenet_reader as reader
+import models
+from utility import add_arguments, print_arguments
+import numpy as np
+import time
+from paddleslim.prune import Pruner
+from paddleslim.analysis import flops
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+# yapf: disable
+add_arg('batch_size',       int,  64 * 4,                 "Minibatch size.")
+add_arg('use_gpu',          bool, True,                "Whether to use GPU or not.")
+add_arg('model',            str,  "MobileNet",                "The target model.")
+add_arg('model_save_dir',            str,  "./",                "checkpoint  model.")
+add_arg('pretrained_model', str,  "../pretrained_model/MobileNetV1_pretained",                "Whether to use pretrained model.")
+add_arg('lr',               float,  0.01,               "The learning rate used to fine-tune pruned model.")
+add_arg('lr_strategy',      str,  "piecewise_decay",   "The learning rate decay strategy.")
+add_arg('l2_decay',         float,  3e-5,               "The l2_decay parameter.")
+add_arg('momentum_rate',    float,  0.9,               "The value of momentum_rate.")
+add_arg('num_epochs',       int,  20,               "The number of total epochs.")
+add_arg('total_images',     int,  1281167,               "The number of total training images.")
+parser.add_argument('--step_epochs', nargs='+', type=int, default=[5, 15], help="piecewise decay step")
+add_arg('config_file',      str, None,                 "The config file for compression with yaml format.")
+# yapf: enable
+
+model_list = [m for m in dir(models) if "__" not in m]
+ratiolist = [
+    #    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
+    #    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
+]
+
+
+def save_model(args, exe, train_prog, eval_prog, info):
+    model_path = os.path.join(args.model_save_dir, args.model, str(info))
+    if not os.path.isdir(model_path):
+        os.makedirs(model_path)
+    fluid.io.save_persistables(exe, model_path, main_program=train_prog)
+    print("Already save model in %s" % (model_path))
+
+
+def piecewise_decay(args):
+    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    bd = [step * e for e in args.step_epochs]
+    lr = [args.lr * (0.1**i) for i in range(len(bd) + 1)]
+    learning_rate = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=learning_rate,
+        momentum=args.momentum_rate,
+        regularization=fluid.regularizer.L2Decay(args.l2_decay))
+    return optimizer
+
+
+def cosine_decay(args):
+    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    learning_rate = fluid.layers.cosine_decay(
+        learning_rate=args.lr, step_each_epoch=step, epochs=args.num_epochs)
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=learning_rate,
+        momentum=args.momentum_rate,
+        regularization=fluid.regularizer.L2Decay(args.l2_decay))
+    return optimizer
+
+
+def create_optimizer(args):
+    if args.lr_strategy == "piecewise_decay":
+        return piecewise_decay(args)
+    elif args.lr_strategy == "cosine_decay":
+        return cosine_decay(args)
+
+
+def compress(args):
+    class_dim = 1000
+    image_shape = "3,224,224"
+    image_shape = [int(m) for m in image_shape.split(",")]
+    assert args.model in model_list, "{} is not in lists: {}".format(
+        args.model, model_list)
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    # model definition
+    model = models.__dict__[args.model]()
+    out = model.net(input=image, class_dim=class_dim)
+    cost = fluid.layers.cross_entropy(input=out, label=label)
+    avg_cost = fluid.layers.mean(x=cost)
+    acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
+    acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
+    val_program = fluid.default_main_program().clone(for_test=True)
+    opt = create_optimizer(args)
+    opt.minimize(avg_cost)
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    if args.pretrained_model:
+
+        def if_exist(var):
+            exist = os.path.exists(
+                os.path.join(args.pretrained_model, var.name))
+            print("exist", exist)
+            return exist
+
+        #fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)
+
+    val_reader = paddle.fluid.io.batch(reader.val(), batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
+        reader.train(), batch_size=args.batch_size, drop_last=True)
+
+    train_feeder = feeder = fluid.DataFeeder([image, label], place)
+    val_feeder = feeder = fluid.DataFeeder(
+        [image, label], place, program=val_program)
+
+    def test(epoch, program):
+        batch_id = 0
+        acc_top1_ns = []
+        acc_top5_ns = []
+        for data in val_reader():
+            start_time = time.time()
+            acc_top1_n, acc_top5_n = exe.run(
+                program,
+                feed=train_feeder.feed(data),
+                fetch_list=[acc_top1.name, acc_top5.name])
+            end_time = time.time()
+            print(
+                "Eval epoch[{}] batch[{}] - acc_top1: {}; acc_top5: {}; time: {}".
+                format(epoch, batch_id,
+                       np.mean(acc_top1_n),
+                       np.mean(acc_top5_n), end_time - start_time))
+            acc_top1_ns.append(np.mean(acc_top1_n))
+            acc_top5_ns.append(np.mean(acc_top5_n))
+            batch_id += 1
+
+        print("Final eval epoch[{}] - acc_top1: {}; acc_top5: {}".format(
+            epoch,
+            np.mean(np.array(acc_top1_ns)), np.mean(np.array(acc_top5_ns))))
+
+    def train(epoch, program):
+
+        build_strategy = fluid.BuildStrategy()
+        exec_strategy = fluid.ExecutionStrategy()
+        train_program = fluid.compiler.CompiledProgram(
+            program).with_data_parallel(
+                loss_name=avg_cost.name,
+                build_strategy=build_strategy,
+                exec_strategy=exec_strategy)
+
+        batch_id = 0
+        for data in train_reader():
+            start_time = time.time()
+            loss_n, acc_top1_n, acc_top5_n, lr_n = exe.run(
+                train_program,
+                feed=train_feeder.feed(data),
+                fetch_list=[
+                    avg_cost.name, acc_top1.name, acc_top5.name,
+                    "learning_rate"
+                ])
+            end_time = time.time()
+            loss_n = np.mean(loss_n)
+            acc_top1_n = np.mean(acc_top1_n)
+            acc_top5_n = np.mean(acc_top5_n)
+            lr_n = np.mean(lr_n)
+            print(
+                "epoch[{}]-batch[{}] - loss: {}; acc_top1: {}; acc_top5: {};lrn: {}; time: {}".
+                format(epoch, batch_id, loss_n, acc_top1_n, acc_top5_n, lr_n,
+                       end_time - start_time))
+            batch_id += 1
+
+    params = []
+    for param in fluid.default_main_program().global_block().all_parameters():
+        #if "_weights" in  param.name and "conv1_weights" not in param.name:
+        if "_sep_weights" in param.name:
+            params.append(param.name)
+    print("fops before pruning: {}".format(
+        flops(fluid.default_main_program())))
+    pruned_program_iter = fluid.default_main_program()
+    pruned_val_program_iter = val_program
+    for ratios in ratiolist:
+        pruner = Pruner()
+        pruned_val_program_iter = pruner.prune(
+            pruned_val_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place,
+            only_graph=True)
+
+        pruned_program_iter = pruner.prune(
+            pruned_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place)
+        print("fops after pruning: {}".format(flops(pruned_program_iter)))
+    """ do not inherit learning rate """
+    if (os.path.exists(args.pretrained_model + "/learning_rate")):
+        os.remove(args.pretrained_model + "/learning_rate")
+    if (os.path.exists(args.pretrained_model + "/@LR_DECAY_COUNTER@")):
+        os.remove(args.pretrained_model + "/@LR_DECAY_COUNTER@")
+    fluid.io.load_vars(
+        exe,
+        args.pretrained_model,
+        main_program=pruned_program_iter,
+        predicate=if_exist)
+
+    pruned_program = pruned_program_iter
+    pruned_val_program = pruned_val_program_iter
+    for i in range(args.num_epochs):
+        train(i, pruned_program)
+        test(i, pruned_val_program)
+        save_model(args, exe, pruned_program, pruned_val_program, i)
+
+
+def main():
+    args = parser.parse_args()
+    print_arguments(args)
+    compress(args)
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/auto_prune/train_iterator.py
+++ b/demo/auto_prune/train_iterator.py
+import os
+import sys
+import logging
+import paddle
+import argparse
+import functools
+import math
+import time
+import numpy as np
+import paddle.fluid as fluid
+from paddleslim.prune import AutoPruner
+from paddleslim.common import get_logger
+from paddleslim.analysis import flops
+from paddleslim.prune import Pruner
+sys.path.append(sys.path[0] + "/../")
+import models
+from utility import add_arguments, print_arguments
+
+_logger = get_logger(__name__, level=logging.INFO)
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+# yapf: disable
+add_arg('batch_size',       int,  64 * 4,                 "Minibatch size.")
+add_arg('use_gpu',          bool, True,                "Whether to use GPU or not.")
+add_arg('model',            str,  "MobileNet",                "The target model.")
+add_arg('pretrained_model', str,  "../pretrained_model/MobileNetV1_pretained",                "Whether to use pretrained model.")
+add_arg('model_save_dir',   str,  "./",                "checkpoint  model.")
+add_arg('lr',               float,  0.1,               "The learning rate used to fine-tune pruned model.")
+add_arg('lr_strategy',      str,  "piecewise_decay",   "The learning rate decay strategy.")
+add_arg('l2_decay',         float,  3e-5,               "The l2_decay parameter.")
+add_arg('momentum_rate',    float,  0.9,               "The value of momentum_rate.")
+add_arg('num_epochs',       int,  120,               "The number of total epochs.")
+add_arg('total_images',     int,  1281167,               "The number of total training images.")
+parser.add_argument('--step_epochs', nargs='+', type=int, default=[30, 60, 90], help="piecewise decay step")
+add_arg('config_file',      str, None,                 "The config file for compression with yaml format.")
+add_arg('data',             str, "mnist",                 "Which data to use. 'mnist' or 'imagenet'")
+add_arg('log_period',       int, 10,                 "Log period in batches.")
+add_arg('test_period',      int, 10,                 "Test period in epoches.")
+# yapf: enable
+
+model_list = [m for m in dir(models) if "__" not in m]
+ratiolist = [
+    #    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
+    #    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
+]
+
+
+def piecewise_decay(args):
+    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    bd = [step * e for e in args.step_epochs]
+    lr = [args.lr * (0.1**i) for i in range(len(bd) + 1)]
+    learning_rate = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=learning_rate,
+        momentum=args.momentum_rate,
+        regularization=fluid.regularizer.L2Decay(args.l2_decay))
+    return optimizer
+
+
+def cosine_decay(args):
+    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    learning_rate = fluid.layers.cosine_decay(
+        learning_rate=args.lr, step_each_epoch=step, epochs=args.num_epochs)
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=learning_rate,
+        momentum=args.momentum_rate,
+        regularization=fluid.regularizer.L2Decay(args.l2_decay))
+    return optimizer
+
+
+def create_optimizer(args):
+    if args.lr_strategy == "piecewise_decay":
+        return piecewise_decay(args)
+    elif args.lr_strategy == "cosine_decay":
+        return cosine_decay(args)
+
+
+def compress(args):
+
+    train_reader = None
+    test_reader = None
+    if args.data == "mnist":
+        import paddle.dataset.mnist as reader
+        train_reader = reader.train()
+        val_reader = reader.test()
+        class_dim = 10
+        image_shape = "1,28,28"
+    elif args.data == "imagenet":
+        import imagenet_reader as reader
+        train_reader = reader.train()
+        val_reader = reader.val()
+        class_dim = 1000
+        image_shape = "3,224,224"
+    else:
+        raise ValueError("{} is not supported.".format(args.data))
+
+    image_shape = [int(m) for m in image_shape.split(",")]
+    assert args.model in model_list, "{} is not in lists: {}".format(
+        args.model, model_list)
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    # model definition
+    model = models.__dict__[args.model]()
+    out = model.net(input=image, class_dim=class_dim)
+    cost = fluid.layers.cross_entropy(input=out, label=label)
+    avg_cost = fluid.layers.mean(x=cost)
+    acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
+    acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
+    val_program = fluid.default_main_program().clone(for_test=True)
+    opt = create_optimizer(args)
+    opt.minimize(avg_cost)
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    if args.pretrained_model:
+
+        def if_exist(var):
+            return os.path.exists(
+                os.path.join(args.pretrained_model, var.name))
+
+#        fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)
+
+    val_reader = paddle.fluid.io.batch(val_reader, batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
+        train_reader, batch_size=args.batch_size, drop_last=True)
+
+    train_feeder = feeder = fluid.DataFeeder([image, label], place)
+    val_feeder = feeder = fluid.DataFeeder(
+        [image, label], place, program=val_program)
+
+    def test(epoch, program):
+        batch_id = 0
+        acc_top1_ns = []
+        acc_top5_ns = []
+        for data in val_reader():
+            start_time = time.time()
+            acc_top1_n, acc_top5_n = exe.run(
+                program,
+                feed=train_feeder.feed(data),
+                fetch_list=[acc_top1.name, acc_top5.name])
+            end_time = time.time()
+            if batch_id % args.log_period == 0:
+                _logger.info(
+                    "Eval epoch[{}] batch[{}] - acc_top1: {}; acc_top5: {}; time: {}".
+                    format(epoch, batch_id,
+                           np.mean(acc_top1_n),
+                           np.mean(acc_top5_n), end_time - start_time))
+            acc_top1_ns.append(np.mean(acc_top1_n))
+            acc_top5_ns.append(np.mean(acc_top5_n))
+            batch_id += 1
+
+        _logger.info("Final eval epoch[{}] - acc_top1: {}; acc_top5: {}".
+                     format(epoch,
+                            np.mean(np.array(acc_top1_ns)),
+                            np.mean(np.array(acc_top5_ns))))
+        return np.mean(np.array(acc_top1_ns))
+
+    def train(epoch, program):
+
+        build_strategy = fluid.BuildStrategy()
+        exec_strategy = fluid.ExecutionStrategy()
+        train_program = fluid.compiler.CompiledProgram(
+            program).with_data_parallel(
+                loss_name=avg_cost.name,
+                build_strategy=build_strategy,
+                exec_strategy=exec_strategy)
+
+        batch_id = 0
+        for data in train_reader():
+            start_time = time.time()
+            loss_n, acc_top1_n, acc_top5_n = exe.run(
+                train_program,
+                feed=train_feeder.feed(data),
+                fetch_list=[avg_cost.name, acc_top1.name, acc_top5.name])
+            end_time = time.time()
+            loss_n = np.mean(loss_n)
+            acc_top1_n = np.mean(acc_top1_n)
+            acc_top5_n = np.mean(acc_top5_n)
+            if batch_id % args.log_period == 0:
+                _logger.info(
+                    "epoch[{}]-batch[{}] - loss: {}; acc_top1: {}; acc_top5: {}; time: {}".
+                    format(epoch, batch_id, loss_n, acc_top1_n, acc_top5_n,
+                           end_time - start_time))
+            batch_id += 1
+
+    params = []
+    for param in fluid.default_main_program().global_block().all_parameters():
+        if "_sep_weights" in param.name:
+            params.append(param.name)
+
+    pruned_program_iter = fluid.default_main_program()
+    pruned_val_program_iter = val_program
+
+    for ratios in ratiolist:
+        pruner = Pruner()
+        pruned_val_program_iter = pruner.prune(
+            pruned_val_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place,
+            only_graph=True)
+
+        pruned_program_iter = pruner.prune(
+            pruned_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place)
+        print("fops after pruning: {}".format(flops(pruned_program_iter)))
+    fluid.io.load_vars(
+        exe,
+        args.pretrained_model,
+        main_program=pruned_program_iter,
+        predicate=if_exist)
+
+    pruner = AutoPruner(
+        pruned_val_program_iter,
+        fluid.global_scope(),
+        place,
+        params=params,
+        init_ratios=[0.1] * len(params),
+        pruned_flops=0.1,
+        pruned_latency=None,
+        server_addr=("", 0),
+        init_temperature=100,
+        reduce_rate=0.85,
+        max_try_times=300,
+        max_client_num=10,
+        search_steps=100,
+        max_ratios=0.2,
+        min_ratios=0.,
+        is_server=True,
+        key="auto_pruner")
+
+    while True:
+        pruned_program, pruned_val_program = pruner.prune(
+            pruned_program_iter, pruned_val_program_iter)
+        for i in range(0):
+            train(i, pruned_program)
+        score = test(0, pruned_val_program)
+        pruner.reward(score)
+
+
+def main():
+    args = parser.parse_args()
+    print_arguments(args)
+    compress(args)
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/bert/run.sh
+++ b/demo/bert/run.sh
+CUDA_VISIBLE_DEVICES=0 python2 -u train_cell_base.py
--- a/demo/bert/train_distill.py
+++ b/demo/bert/train_distill.py
+import numpy as np
+from itertools import izip
+import paddle.fluid as fluid
+from paddleslim.teachers.bert.reader.cls import *
+from paddleslim.nas.darts.search_space import AdaBERTClassifier
+from paddle.fluid.dygraph.base import to_variable
+from tqdm import tqdm
+import os
+import pickle
+
+import logging
+from paddleslim.common import AvgrageMeter, get_logger
+logger = get_logger(__name__, level=logging.INFO)
+
+
+def valid_one_epoch(model, valid_loader, epoch, log_freq):
+    accs = AvgrageMeter()
+    ce_losses = AvgrageMeter()
+    model.student.eval()
+
+    step_id = 0
+    for valid_data in valid_loader():
+        try:
+            loss, acc, ce_loss, _, _ = model._layers.loss(valid_data, epoch)
+        except:
+            loss, acc, ce_loss, _, _ = model.loss(valid_data, epoch)
+
+        batch_size = valid_data[0].shape[0]
+        ce_losses.update(ce_loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+        step_id += 1
+    return ce_losses.avg[0], accs.avg[0]
+
+
+def train_one_epoch(model, train_loader, optimizer, epoch, use_data_parallel,
+                    log_freq):
+    total_losses = AvgrageMeter()
+    accs = AvgrageMeter()
+    ce_losses = AvgrageMeter()
+    kd_losses = AvgrageMeter()
+    model.student.train()
+
+    step_id = 0
+    for train_data in train_loader():
+        batch_size = train_data[0].shape[0]
+
+        if use_data_parallel:
+            total_loss, acc, ce_loss, kd_loss, _ = model._layers.loss(
+                train_data, epoch)
+        else:
+            total_loss, acc, ce_loss, kd_loss, _ = model.loss(train_data,
+                                                              epoch)
+
+        if use_data_parallel:
+            total_loss = model.scale_loss(total_loss)
+            total_loss.backward()
+            model.apply_collective_grads()
+        else:
+            total_loss.backward()
+        optimizer.minimize(total_loss)
+        model.clear_gradients()
+        total_losses.update(total_loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+        ce_losses.update(ce_loss.numpy(), batch_size)
+        kd_losses.update(kd_loss.numpy(), batch_size)
+
+        if step_id % log_freq == 0:
+            logger.info(
+                "Train Epoch {}, Step {}, Lr {:.6f} total_loss {:.6f}; ce_loss {:.6f}, kd_loss {:.6f}, train_acc {:.6f};".
+                format(epoch, step_id,
+                       optimizer.current_step_lr(), total_losses.avg[0],
+                       ce_losses.avg[0], kd_losses.avg[0], accs.avg[0]))
+        step_id += 1
+
+
+def main():
+    # whether use multi-gpus
+    use_data_parallel = False
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env(
+    ).dev_id) if use_data_parallel else fluid.CUDAPlace(0)
+
+    BERT_BASE_PATH = "./data/pretrained_models/uncased_L-12_H-768_A-12"
+    vocab_path = BERT_BASE_PATH + "/vocab.txt"
+
+    do_lower_case = True
+    # augmented dataset nums
+    # num_samples = 8016987
+
+    max_seq_len = 128
+    batch_size = 192
+    hidden_size = 768
+    emb_size = 768
+    epoch = 80
+    log_freq = 10
+
+    task_name = 'mnli'
+
+    if task_name == 'mrpc':
+        data_dir = "./data/glue_data/MRPC/"
+        teacher_model_dir = "./data/teacher_model/mrpc"
+        num_samples = 3668
+        max_layer = 4
+        num_labels = 2
+        processor_func = MrpcProcessor
+    elif task_name == 'mnli':
+        data_dir = "./data/glue_data/MNLI/"
+        teacher_model_dir = "./data/teacher_model/steps_23000"
+        num_samples = 392702
+        max_layer = 8
+        num_labels = 3
+        processor_func = MnliProcessor
+
+    device_num = fluid.dygraph.parallel.Env().nranks
+    use_fixed_gumbel = True
+    train_phase = "train"
+    val_phase = "dev"
+    step_per_epoch = int(num_samples / (batch_size * device_num))
+
+    with fluid.dygraph.guard(place):
+        if use_fixed_gumbel:
+            # make sure gumbel arch is constant
+            np.random.seed(1)
+            fluid.default_main_program().random_seed = 1
+        model = AdaBERTClassifier(
+            num_labels,
+            n_layer=max_layer,
+            hidden_size=hidden_size,
+            task_name=task_name,
+            emb_size=emb_size,
+            teacher_model=teacher_model_dir,
+            data_dir=data_dir,
+            use_fixed_gumbel=use_fixed_gumbel)
+
+        learning_rate = fluid.dygraph.CosineDecay(2e-2, step_per_epoch, epoch)
+
+        model_parameters = []
+        for p in model.parameters():
+            if (p.name not in [a.name for a in model.arch_parameters()] and
+                    p.name not in
+                [a.name for a in model.teacher.parameters()]):
+                model_parameters.append(p)
+
+        optimizer = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            0.9,
+            regularization=fluid.regularizer.L2DecayRegularizer(3e-4),
+            parameter_list=model_parameters)
+
+        processor = processor_func(
+            data_dir=data_dir,
+            vocab_path=vocab_path,
+            max_seq_len=max_seq_len,
+            do_lower_case=do_lower_case,
+            in_tokens=False)
+
+        train_reader = processor.data_generator(
+            batch_size=batch_size,
+            phase=train_phase,
+            epoch=1,
+            dev_count=1,
+            shuffle=True)
+        dev_reader = processor.data_generator(
+            batch_size=batch_size,
+            phase=val_phase,
+            epoch=1,
+            dev_count=1,
+            shuffle=False)
+
+        if use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+
+        train_loader = fluid.io.DataLoader.from_generator(
+            capacity=128,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        dev_loader = fluid.io.DataLoader.from_generator(
+            capacity=128,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+
+        train_loader.set_batch_generator(train_reader, places=place)
+        dev_loader.set_batch_generator(dev_reader, places=place)
+
+        if use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        best_valid_acc = 0
+        for epoch_id in range(epoch):
+            train_one_epoch(model, train_loader, optimizer, epoch_id,
+                            use_data_parallel, log_freq)
+            loss, acc = valid_one_epoch(model, dev_loader, epoch_id, log_freq)
+            if acc > best_valid_acc:
+                best_valid_acc = acc
+            logger.info(
+                "dev set, ce_loss {:.6f}; acc {:.6f}, best_acc {:.6f};".format(
+                    loss, acc, best_valid_acc))
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/bert/train_search.py
+++ b/demo/bert/train_search.py
+import numpy as np
+from itertools import izip
+import paddle.fluid as fluid
+from paddleslim.teachers.bert.reader.cls import *
+from paddleslim.nas.darts.search_space import AdaBERTClassifier
+from paddle.fluid.dygraph.base import to_variable
+from tqdm import tqdm
+import os
+import pickle
+
+import logging
+from paddleslim.common import AvgrageMeter, get_logger
+logger = get_logger(__name__, level=logging.INFO)
+
+
+def valid_one_epoch(model, valid_loader, epoch, log_freq):
+    accs = AvgrageMeter()
+    ce_losses = AvgrageMeter()
+    model.student.eval()
+
+    step_id = 0
+    for valid_data in valid_loader():
+        try:
+            loss, acc, ce_loss, _, _ = model._layers.loss(valid_data, epoch)
+        except:
+            loss, acc, ce_loss, _, _ = model.loss(valid_data, epoch)
+
+        batch_size = valid_data[0].shape[0]
+        ce_losses.update(ce_loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+        step_id += 1
+    return ce_losses.avg[0], accs.avg[0]
+
+
+def train_one_epoch(model, train_loader, valid_loader, optimizer,
+                    arch_optimizer, epoch, use_data_parallel, log_freq):
+    total_losses = AvgrageMeter()
+    accs = AvgrageMeter()
+    ce_losses = AvgrageMeter()
+    kd_losses = AvgrageMeter()
+    val_accs = AvgrageMeter()
+    model.student.train()
+
+    step_id = 0
+    for train_data, valid_data in izip(train_loader(), valid_loader()):
+        batch_size = train_data[0].shape[0]
+        # make sure arch on every gpu is same, otherwise an error will occurs
+        np.random.seed(step_id * 2 * (epoch + 1))
+        if use_data_parallel:
+            total_loss, acc, ce_loss, kd_loss, _ = model._layers.loss(
+                train_data, epoch)
+        else:
+            total_loss, acc, ce_loss, kd_loss, _ = model.loss(train_data,
+                                                              epoch)
+
+        if use_data_parallel:
+            total_loss = model.scale_loss(total_loss)
+            total_loss.backward()
+            model.apply_collective_grads()
+        else:
+            total_loss.backward()
+        optimizer.minimize(total_loss)
+        model.clear_gradients()
+        total_losses.update(total_loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+        ce_losses.update(ce_loss.numpy(), batch_size)
+        kd_losses.update(kd_loss.numpy(), batch_size)
+
+        # make sure arch on every gpu is same, otherwise an error will occurs
+        np.random.seed(step_id * 2 * (epoch + 1) + 1)
+        if use_data_parallel:
+            arch_loss, _, _, _, arch_logits = model._layers.loss(valid_data,
+                                                                 epoch)
+        else:
+            arch_loss, _, _, _, arch_logits = model.loss(valid_data, epoch)
+
+        if use_data_parallel:
+            arch_loss = model.scale_loss(arch_loss)
+            arch_loss.backward()
+            model.apply_collective_grads()
+        else:
+            arch_loss.backward()
+        arch_optimizer.minimize(arch_loss)
+        model.clear_gradients()
+        probs = fluid.layers.softmax(arch_logits[-1])
+        val_acc = fluid.layers.accuracy(input=probs, label=valid_data[4])
+        val_accs.update(val_acc.numpy(), batch_size)
+
+        if step_id % log_freq == 0:
+            logger.info(
+                "Train Epoch {}, Step {}, Lr {:.6f} total_loss {:.6f}; ce_loss {:.6f}, kd_loss {:.6f}, train_acc {:.6f}, search_valid_acc {:.6f};".
+                format(epoch, step_id,
+                       optimizer.current_step_lr(), total_losses.avg[
+                           0], ce_losses.avg[0], kd_losses.avg[0], accs.avg[0],
+                       val_accs.avg[0]))
+
+        step_id += 1
+
+
+def main():
+    # whether use multi-gpus
+    use_data_parallel = False
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env(
+    ).dev_id) if use_data_parallel else fluid.CUDAPlace(0)
+
+    BERT_BASE_PATH = "./data/pretrained_models/uncased_L-12_H-768_A-12"
+    vocab_path = BERT_BASE_PATH + "/vocab.txt"
+    data_dir = "./data/glue_data/MNLI/"
+    teacher_model_dir = "./data/teacher_model/steps_23000"
+    do_lower_case = True
+    num_samples = 392702
+    # augmented dataset nums
+    # num_samples = 8016987
+    max_seq_len = 128
+    batch_size = 128
+    hidden_size = 768
+    emb_size = 768
+    max_layer = 8
+    epoch = 80
+    log_freq = 10
+    device_num = fluid.dygraph.parallel.Env().nranks
+
+    use_fixed_gumbel = False
+    train_phase = "search_train"
+    val_phase = "search_valid"
+    step_per_epoch = int(num_samples * 0.5 / ((batch_size) * device_num))
+
+    with fluid.dygraph.guard(place):
+        model = AdaBERTClassifier(
+            3,
+            n_layer=max_layer,
+            hidden_size=hidden_size,
+            emb_size=emb_size,
+            teacher_model=teacher_model_dir,
+            data_dir=data_dir,
+            use_fixed_gumbel=use_fixed_gumbel)
+
+        learning_rate = fluid.dygraph.CosineDecay(2e-2, step_per_epoch, epoch)
+
+        model_parameters = []
+        for p in model.parameters():
+            if (p.name not in [a.name for a in model.arch_parameters()] and
+                    p.name not in
+                [a.name for a in model.teacher.parameters()]):
+                model_parameters.append(p)
+
+        optimizer = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            0.9,
+            regularization=fluid.regularizer.L2DecayRegularizer(3e-4),
+            parameter_list=model_parameters)
+
+        arch_optimizer = fluid.optimizer.Adam(
+            3e-4,
+            0.5,
+            0.999,
+            regularization=fluid.regularizer.L2Decay(1e-3),
+            parameter_list=model.arch_parameters())
+
+        processor = MnliProcessor(
+            data_dir=data_dir,
+            vocab_path=vocab_path,
+            max_seq_len=max_seq_len,
+            do_lower_case=do_lower_case,
+            in_tokens=False)
+
+        train_reader = processor.data_generator(
+            batch_size=batch_size,
+            phase=train_phase,
+            epoch=1,
+            dev_count=1,
+            shuffle=True)
+        valid_reader = processor.data_generator(
+            batch_size=batch_size,
+            phase=val_phase,
+            epoch=1,
+            dev_count=1,
+            shuffle=True)
+        dev_reader = processor.data_generator(
+            batch_size=batch_size,
+            phase="dev",
+            epoch=1,
+            dev_count=1,
+            shuffle=False)
+
+        if use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+            valid_reader = fluid.contrib.reader.distributed_batch_reader(
+                valid_reader)
+
+        train_loader = fluid.io.DataLoader.from_generator(
+            capacity=128,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        valid_loader = fluid.io.DataLoader.from_generator(
+            capacity=128,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        dev_loader = fluid.io.DataLoader.from_generator(
+            capacity=128,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+
+        train_loader.set_batch_generator(train_reader, places=place)
+        valid_loader.set_batch_generator(valid_reader, places=place)
+        dev_loader.set_batch_generator(dev_reader, places=place)
+
+        if use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        for epoch_id in range(epoch):
+            train_one_epoch(model, train_loader, valid_loader, optimizer,
+                            arch_optimizer, epoch_id, use_data_parallel,
+                            log_freq)
+            loss, acc = valid_one_epoch(model, dev_loader, epoch_id, log_freq)
+            logger.info("dev set, ce_loss {:.6f}; acc: {:.6f};".format(loss,
+                                                                       acc))
+
+            if use_data_parallel:
+                print(model._layers.student._encoder.alphas.numpy())
+            else:
+                print(model.student._encoder.alphas.numpy())
+            print("=" * 100)
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/bert/train_teacher.py
+++ b/demo/bert/train_teacher.py
+import paddle.fluid as fluid
+from paddleslim.teachers.bert import BERTClassifier
+
+place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+with fluid.dygraph.guard(place):
+
+    bert = BERTClassifier(3)
+    bert.fit("./data/glue_data/MNLI/",
+             5,
+             batch_size=32,
+             use_data_parallel=True,
+             learning_rate=0.00005,
+             save_steps=1000)
--- a/demo/darts/README.md
+++ b/demo/darts/README.md
+# 可微分架构搜索DARTS(Differentiable Architecture Search)方法使用示例
+
+本示例介绍如何使用PaddlePaddle进行可微分架构搜索，可以直接使用[DARTS](https://arxiv.org/abs/1806.09055)和[PC-DARTS](https://arxiv.org/abs/1907.05737)两种方法，也支持自定义修改后使用其他可微分架构搜索算法。
+
+本示例目录结构如下：
+```
+├── genotypes.py 搜索过程得到的模型结构Genotypes
+│
+├── model.py 对搜索得到的子网络组网
+│
+├── model_search.py 对搜索前的超网络组网
+│
+├── operations.py 用于搜索的多种运算符组合
+│
+├── reader.py 数据读取与增广部分
+│
+├── search.py 模型结构搜索入口
+│
+├── train.py CIFAR10数据集评估训练入口
+│
+├── train_imagenet.py ImageNet数据集评估训练入口
+│
+├── visualize.py 模型结构可视化入口
+
+```
+
+## 依赖项
+
+PaddlePaddle >= 1.8.0, PaddleSlim >= 1.1.0, graphviz >= 0.11.1
+
+## 数据集
+
+本示例使用`CIFAR10`数据集进行架构搜索，可选择在`CIFAR10`或`ImageNet`数据集上做架构评估。
+`CIFAR10`数据集可以在进行架构搜索或评估的过程中自动下载，`ImageNet`数据集需要自行下载，可参照此[教程](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#%E6%95%B0%E6%8D%AE%E5%87%86%E5%A4%87)
+
+
+## 网络结构搜索
+
+搜索方法支持DARTS的一阶、二阶近似搜索方法和PC-DARTS的搜索方法:
+``` bash
+python search.py                       # DARTS一阶近似搜索方法
+python search.py --unrolled=True       # DARTS的二阶近似搜索方法
+python search.py --method='PC-DARTS' --batch_size=256 --learning_rate=0.1 --arch_learning_rate=6e-4 --epochs_no_archopt=15   # PC-DARTS搜索方法
+```
+如果您使用的是docker环境，请确保共享内存足够使用多进程的dataloader，如果碰到共享内存问题，请设置`--use_multiprocess=False`
+
+也可以使用多卡进行模型结构搜索，以4卡为例(GPU id: 0-3), 启动命令如下:
+
+```bash
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog search.py --use_data_parallel 1
+```
+
+因为使用多卡训练总的BatchSize会扩大n倍，n代表卡数，为了获得与单卡相当的准确率效果，请相应的将初始学习率扩大n倍。
+
+模型结构随搜索轮数的变化如图1所示。需要注意的是，图中准确率Acc并不代表该结构最终准确率，为了获得当前结构的最佳准确率，请对得到的genotype做网络结构评估训练。
+
+![networks](images/networks.gif)
+
+<p align="center">
+图1: 在CIFAR10数据集上进行搜索的模型结构变化，上半部分为reduction cell，下半部分为normal cell
+</p>
+
+使用三种搜索方法得到的结构Genotype已添加到了genotypes.py文件中，`DARTS_V1`、`DARTS_V2`和`PC_DARTS`分别代表使用DARTS一阶、二阶近似方法和PC-DARTS搜索方法得到的网络结构。
+
+## 网络结构评估训练
+
+在得到搜索结构Genotype之后，可以对其进行评估训练，从而获得它在特定数据集上的真实性能
+
+```bash
+python train.py --arch='PC_DARTS'            # 在CIFAR10数据集上对搜索到的结构评估训练
+python train_imagenet.py --arch='PC_DARTS'   # 在ImageNet数据集上对搜索得到的结构评估训练
+```
+
+同样，也支持用多卡进行评估训练, 以4卡为例(GPU id: 0-3), 启动命令如下：
+
+```bash
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3  --log_dir ./mylog train.py --use_data_parallel 1 --arch='DARTS_V2'
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3  --log_dir ./mylog train_imagenet.py --use_data_parallel 1 --arch='DARTS_V2'
+```
+
+同理，使用多卡训练总的BatchSize会扩大n倍，n代表卡数，为了获得与单卡相当的准确率效果，请相应的将初始学习率扩大n倍。
+
+对搜索到的`DARTS_V1`、`DARTS_V2`和`PC-DARTS`做评估训练的结果如下：
+
+| 模型结构                    | 数据集   | 准确率          |
+| --------------------------- | -------- | --------------- |
+| DARTS_V1                    | CIFAR10  | 97.01%          |
+| DARTS（一阶搜索，论文数据） | CIFAR10  | 97.00$\pm$0.14% |
+| DARTS_V2                    | CIFAR10  | 97.26%          |
+| DARTS  (二阶搜索，论文数据) | CIFAR10  | 97.24$\pm$0.09% |
+| DARTS_V2                    | ImageNet | 74.12%          |
+| DARTS (二阶搜索，论文数据)  | ImageNet | 73.30%          |
+| PC-DARTS                    | CIFAR10  | 97.41%          |
+| PC-DARTS （论文数据）       | CIFAR10  | 97.43$\pm$0.07% |
+
+## 自定义数据集与搜索空间
+
+### 修改数据集
+
+本示例默认使用CIFAR10数据集进行搜索，如果需要替换为其他自定义数据集只需要对reader.py进行少量代码修改：
+
+```python
+def train_search(batch_size, train_portion, is_shuffle, args):
+    datasets = cifar10_reader(                                          #对此进行替换
+        paddle.dataset.common.download(CIFAR10_URL, 'cifar', CIFAR10_MD5),
+        'data_batch', is_shuffle, args)
+```
+
+将默认使用的`cifar10_reader`替换为特定数据集的reader即可
+
+### 修改搜索空间
+
+本示例提供了DARTS和PC-DARTS两种方法，定义在model_search.py中
+
+可以直接修改model_search.py中定义的`class Network`对搜索空间进行自定义，使用paddleslim.nas.DARTSearch对该结构进行搜索
+
+搜索结束后对model.py做相应的修改进行评估训练。
+
+
+
+## 搜索结构可视化
+
+使用以下命令对搜索得到的Genotype结构进行可视化观察
+
+```python
+python visualize.py PC_DARTS
+```
+
+`PC_DARTS`代表某个Genotype结构，需要预先添加到genotype.py中
--- a/demo/darts/genotypes.py
+++ b/demo/darts/genotypes.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import namedtuple
+
+Genotype = namedtuple('Genotype', 'normal normal_concat reduce reduce_concat')
+
+PRIMITIVES = [
+    'none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3',
+    'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'
+]
+
+DARTS_V1 = Genotype(
+    normal=[('sep_conv_5x5', 0), ('dil_conv_3x3', 1), ('sep_conv_3x3', 2),
+            ('sep_conv_5x5', 0), ('sep_conv_5x5', 0), ('dil_conv_3x3', 3),
+            ('sep_conv_3x3', 0), ('max_pool_3x3', 1)],
+    normal_concat=range(2, 6),
+    reduce=[('max_pool_3x3', 1), ('max_pool_3x3', 0), ('dil_conv_3x3', 2),
+            ('sep_conv_5x5', 0), ('max_pool_3x3', 0), ('dil_conv_3x3', 3),
+            ('avg_pool_3x3', 3), ('avg_pool_3x3', 4)],
+    reduce_concat=range(2, 6))
+
+DARTS_V2 = Genotype(
+    normal=[('dil_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0),
+            ('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0),
+            ('skip_connect', 0), ('sep_conv_3x3', 1)],
+    normal_concat=range(2, 6),
+    reduce=[('skip_connect', 1), ('max_pool_3x3', 0), ('max_pool_3x3', 1),
+            ('skip_connect', 2), ('skip_connect', 2), ('dil_conv_5x5', 3),
+            ('skip_connect', 2), ('max_pool_3x3', 1)],
+    reduce_concat=range(2, 6))
+
+PC_DARTS = Genotype(
+    normal=[('sep_conv_3x3', 1), ('skip_connect', 0), ('sep_conv_5x5', 0),
+            ('dil_conv_5x5', 2), ('sep_conv_5x5', 0), ('sep_conv_3x3', 2),
+            ('sep_conv_3x3', 0), ('dil_conv_3x3', 1)],
+    normal_concat=range(2, 6),
+    reduce=[('avg_pool_3x3', 0), ('sep_conv_3x3', 1), ('skip_connect', 2),
+            ('avg_pool_3x3', 0), ('dil_conv_5x5', 3), ('skip_connect', 2),
+            ('skip_connect', 2), ('avg_pool_3x3', 0)],
+    reduce_concat=range(2, 6))
--- a/demo/darts/images/networks.gif
+++ b/demo/darts/images/networks.gif
--- a/demo/darts/model.py
+++ b/demo/darts/model.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import ConstantInitializer, MSRAInitializer
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
+from paddle.fluid.dygraph.base import to_variable
+from genotypes import PRIMITIVES
+from genotypes import Genotype
+from operations import *
+
+
+class ConvBN(fluid.dygraph.Layer):
+    def __init__(self, c_curr, c_out, kernel_size, padding, stride, name=None):
+        super(ConvBN, self).__init__()
+        self.conv = Conv2D(
+            num_channels=c_curr,
+            num_filters=c_out,
+            filter_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            param_attr=fluid.ParamAttr(
+                name=name + "_conv" if name is not None else None,
+                initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.bn = BatchNorm(
+            num_channels=c_out,
+            param_attr=fluid.ParamAttr(
+                name=name + "_bn_scale" if name is not None else None,
+                initializer=ConstantInitializer(value=1)),
+            bias_attr=fluid.ParamAttr(
+                name=name + "_bn_offset" if name is not None else None,
+                initializer=ConstantInitializer(value=0)),
+            moving_mean_name=name + "_bn_mean" if name is not None else None,
+            moving_variance_name=name + "_bn_variance"
+            if name is not None else None)
+
+    def forward(self, x):
+        conv = self.conv(x)
+        bn = self.bn(conv)
+        return bn
+
+
+class Classifier(fluid.dygraph.Layer):
+    def __init__(self, input_dim, num_classes, name=None):
+        super(Classifier, self).__init__()
+        self.pool2d = Pool2D(pool_type='avg', global_pooling=True)
+        self.fc = Linear(
+            input_dim=input_dim,
+            output_dim=num_classes,
+            param_attr=fluid.ParamAttr(
+                name=name + "_fc_weights" if name is not None else None,
+                initializer=MSRAInitializer()),
+            bias_attr=fluid.ParamAttr(
+                name=name + "_fc_bias" if name is not None else None,
+                initializer=MSRAInitializer()))
+
+    def forward(self, x):
+        x = self.pool2d(x)
+        x = fluid.layers.squeeze(x, axes=[2, 3])
+        out = self.fc(x)
+        return out
+
+
+def drop_path(x, drop_prob):
+    if drop_prob > 0:
+        keep_prob = 1. - drop_prob
+    mask = 1 - np.random.binomial(
+        1, drop_prob, size=[x.shape[0]]).astype(np.float32)
+    mask = to_variable(mask)
+    x = fluid.layers.elementwise_mul(x / keep_prob, mask, axis=0)
+    return x
+
+
+class Cell(fluid.dygraph.Layer):
+    def __init__(self, genotype, c_prev_prev, c_prev, c_curr, reduction,
+                 reduction_prev):
+        super(Cell, self).__init__()
+        print(c_prev_prev, c_prev, c_curr)
+
+        if reduction_prev:
+            self.preprocess0 = FactorizedReduce(c_prev_prev, c_curr)
+        else:
+            self.preprocess0 = ReLUConvBN(c_prev_prev, c_curr, 1, 1, 0)
+        self.preprocess1 = ReLUConvBN(c_prev, c_curr, 1, 1, 0)
+
+        if reduction:
+            op_names, indices = zip(*genotype.reduce)
+            concat = genotype.reduce_concat
+        else:
+            op_names, indices = zip(*genotype.normal)
+            concat = genotype.normal_concat
+
+        multiplier = len(concat)
+        self._multiplier = multiplier
+        self._compile(c_curr, op_names, indices, multiplier, reduction)
+
+    def _compile(self, c_curr, op_names, indices, multiplier, reduction):
+        assert len(op_names) == len(indices)
+        self._steps = len(op_names) // 2
+        ops = []
+        edge_index = 0
+        for op_name, index in zip(op_names, indices):
+            stride = 2 if reduction and index < 2 else 1
+            op = OPS[op_name](c_curr, stride, True)
+            ops += [op]
+            edge_index += 1
+        self._ops = fluid.dygraph.LayerList(ops)
+        self._indices = indices
+
+    def forward(self, s0, s1, drop_prob, training):
+        s0 = self.preprocess0(s0)
+        s1 = self.preprocess1(s1)
+
+        states = [s0, s1]
+        for i in range(self._steps):
+            h1 = states[self._indices[2 * i]]
+            h2 = states[self._indices[2 * i + 1]]
+            op1 = self._ops[2 * i]
+            op2 = self._ops[2 * i + 1]
+            h1 = op1(h1)
+            h2 = op2(h2)
+            if training and drop_prob > 0.:
+                if not isinstance(op1, Identity):
+                    h1 = drop_path(h1, drop_prob)
+                if not isinstance(op2, Identity):
+                    h2 = drop_path(h2, drop_prob)
+            states += [h1 + h2]
+        out = fluid.layers.concat(input=states[-self._multiplier:], axis=1)
+        return out
+
+
+class AuxiliaryHeadCIFAR(fluid.dygraph.Layer):
+    def __init__(self, C, num_classes):
+        super(AuxiliaryHeadCIFAR, self).__init__()
+        self.avgpool = Pool2D(
+            pool_size=5, pool_stride=3, pool_padding=0, pool_type='avg')
+        self.conv_bn1 = ConvBN(
+            c_curr=C,
+            c_out=128,
+            kernel_size=1,
+            padding=0,
+            stride=1,
+            name='aux_conv_bn1')
+        self.conv_bn2 = ConvBN(
+            c_curr=128,
+            c_out=768,
+            kernel_size=2,
+            padding=0,
+            stride=1,
+            name='aux_conv_bn2')
+        self.classifier = Classifier(768, num_classes, 'aux')
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.avgpool(x)
+        conv1 = self.conv_bn1(x)
+        conv1 = fluid.layers.relu(conv1)
+        conv2 = self.conv_bn2(conv1)
+        conv2 = fluid.layers.relu(conv2)
+        out = self.classifier(conv2)
+        return out
+
+
+class NetworkCIFAR(fluid.dygraph.Layer):
+    def __init__(self, C, num_classes, layers, auxiliary, genotype):
+        super(NetworkCIFAR, self).__init__()
+        self._layers = layers
+        self._auxiliary = auxiliary
+
+        stem_multiplier = 3
+        c_curr = stem_multiplier * C
+        self.stem = ConvBN(
+            c_curr=3, c_out=c_curr, kernel_size=3, padding=1, stride=1)
+
+        c_prev_prev, c_prev, c_curr = c_curr, c_curr, C
+        cells = []
+        reduction_prev = False
+        for i in range(layers):
+            if i in [layers // 3, 2 * layers // 3]:
+                c_curr *= 2
+                reduction = True
+            else:
+                reduction = False
+            cell = Cell(genotype, c_prev_prev, c_prev, c_curr, reduction,
+                        reduction_prev)
+            reduction_prev = reduction
+            cells += [cell]
+            c_prev_prev, c_prev = c_prev, cell._multiplier * c_curr
+            if i == 2 * layers // 3:
+                c_to_auxiliary = c_prev
+        self.cells = fluid.dygraph.LayerList(cells)
+
+        if auxiliary:
+            self.auxiliary_head = AuxiliaryHeadCIFAR(c_to_auxiliary,
+                                                     num_classes)
+        self.classifier = Classifier(c_prev, num_classes)
+
+    def forward(self, input, drop_path_prob, training):
+        logits_aux = None
+        s0 = s1 = self.stem(input)
+        for i, cell in enumerate(self.cells):
+            s0, s1 = s1, cell(s0, s1, drop_path_prob, training)
+            if i == 2 * self._layers // 3:
+                if self._auxiliary and training:
+                    logits_aux = self.auxiliary_head(s1)
+        logits = self.classifier(s1)
+        return logits, logits_aux
+
+
+class AuxiliaryHeadImageNet(fluid.dygraph.Layer):
+    def __init__(self, C, num_classes):
+        super(AuxiliaryHeadImageNet, self).__init__()
+        self.avgpool = Pool2D(
+            pool_size=5, pool_stride=2, pool_padding=0, pool_type='avg')
+        self.conv_bn1 = ConvBN(
+            c_curr=C,
+            c_out=128,
+            kernel_size=1,
+            padding=0,
+            stride=1,
+            name='aux_conv_bn1')
+        self.conv_bn2 = ConvBN(
+            c_curr=128,
+            c_out=768,
+            kernel_size=2,
+            padding=0,
+            stride=1,
+            name='aux_conv_bn2')
+        self.classifier = Classifier(768, num_classes, 'aux')
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.avgpool(x)
+        conv1 = self.conv_bn1(x)
+        conv1 = fluid.layers.relu(conv1)
+        conv2 = self.conv_bn2(conv1)
+        conv2 = fluid.layers.relu(conv2)
+        out = self.classifier(conv2)
+        return out
+
+
+class NetworkImageNet(fluid.dygraph.Layer):
+    def __init__(self, C, num_classes, layers, auxiliary, genotype):
+        super(NetworkImageNet, self).__init__()
+        self._layers = layers
+        self._auxiliary = auxiliary
+
+        self.stem_a0 = ConvBN(
+            c_curr=3, c_out=C // 2, kernel_size=3, padding=1, stride=2)
+
+        self.stem_a1 = ConvBN(
+            c_curr=C // 2, c_out=C, kernel_size=3, padding=1, stride=2)
+
+        self.stem_b = ConvBN(
+            c_curr=C, c_out=C, kernel_size=3, padding=1, stride=2)
+
+        c_prev_prev, c_prev, c_curr = C, C, C
+        cells = []
+        reduction_prev = True
+        for i in range(layers):
+            if i in [layers // 3, 2 * layers // 3]:
+                c_curr *= 2
+                reduction = True
+            else:
+                reduction = False
+            cell = Cell(genotype, c_prev_prev, c_prev, c_curr, reduction,
+                        reduction_prev)
+            reduction_prev = reduction
+            cells += [cell]
+            c_prev_prev, c_prev = c_prev, cell._multiplier * c_curr
+            if i == 2 * layers // 3:
+                c_to_auxiliary = c_prev
+        self.cells = fluid.dygraph.LayerList(cells)
+
+        if auxiliary:
+            self.auxiliary_head = AuxiliaryHeadImageNet(c_to_auxiliary,
+                                                        num_classes)
+        self.classifier = Classifier(c_prev, num_classes)
+
+    def forward(self, input, training):
+        logits_aux = None
+        s0 = self.stem_a0(input)
+        s0 = fluid.layers.relu(s0)
+        s0 = self.stem_a1(s0)
+        s1 = fluid.layers.relu(s0)
+        s1 = self.stem_b(s1)
+
+        for i, cell in enumerate(self.cells):
+            s0, s1 = s1, cell(s0, s1, 0, training)
+            if i == 2 * self._layers // 3:
+                if self._auxiliary and training:
+                    logits_aux = self.auxiliary_head(s1)
+        logits = self.classifier(s1)
+        return logits, logits_aux
--- a/demo/darts/model_search.py
+++ b/demo/darts/model_search.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import NormalInitializer, MSRAInitializer, ConstantInitializer
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
+from paddle.fluid.dygraph.base import to_variable
+from genotypes import PRIMITIVES
+from operations import *
+
+
+def channel_shuffle(x, groups):
+    batchsize, num_channels, height, width = x.shape
+    channels_per_group = num_channels // groups
+
+    # reshape
+    x = fluid.layers.reshape(
+        x, [batchsize, groups, channels_per_group, height, width])
+    x = fluid.layers.transpose(x, [0, 2, 1, 3, 4])
+
+    # flatten
+    x = fluid.layers.reshape(x, [batchsize, num_channels, height, width])
+    return x
+
+
+class MixedOp(fluid.dygraph.Layer):
+    def __init__(self, c_cur, stride, method):
+        super(MixedOp, self).__init__()
+        self._method = method
+        self._k = 4 if self._method == "PC-DARTS" else 1
+        self.mp = Pool2D(
+            pool_size=2,
+            pool_stride=2,
+            pool_type='max', )
+        ops = []
+        for primitive in PRIMITIVES:
+            op = OPS[primitive](c_cur // self._k, stride, False)
+            if 'pool' in primitive:
+                gama = ParamAttr(
+                    initializer=fluid.initializer.Constant(value=1),
+                    trainable=False)
+                beta = ParamAttr(
+                    initializer=fluid.initializer.Constant(value=0),
+                    trainable=False)
+                BN = BatchNorm(
+                    c_cur // self._k, param_attr=gama, bias_attr=beta)
+                op = fluid.dygraph.Sequential(op, BN)
+            ops.append(op)
+        self._ops = fluid.dygraph.LayerList(ops)
+
+    def forward(self, x, weights):
+        if self._method == "PC-DARTS":
+            dim_2 = x.shape[1]
+            xtemp = x[:, :dim_2 // self._k, :, :]
+            xtemp2 = x[:, dim_2 // self._k:, :, :]
+
+            temp1 = fluid.layers.sums(
+                [weights[i] * op(xtemp) for i, op in enumerate(self._ops)])
+
+            if temp1.shape[2] == x.shape[2]:
+                out = fluid.layers.concat([temp1, xtemp2], axis=1)
+            else:
+                out = fluid.layers.concat([temp1, self.mp(xtemp2)], axis=1)
+            out = channel_shuffle(out, self._k)
+        else:
+            out = fluid.layers.sums(
+                [weights[i] * op(x) for i, op in enumerate(self._ops)])
+        return out
+
+
+class Cell(fluid.dygraph.Layer):
+    def __init__(self, steps, multiplier, c_prev_prev, c_prev, c_cur,
+                 reduction, reduction_prev, method):
+        super(Cell, self).__init__()
+        self.reduction = reduction
+
+        if reduction_prev:
+            self.preprocess0 = FactorizedReduce(c_prev_prev, c_cur, False)
+        else:
+            self.preprocess0 = ReLUConvBN(c_prev_prev, c_cur, 1, 1, 0, False)
+        self.preprocess1 = ReLUConvBN(c_prev, c_cur, 1, 1, 0, affine=False)
+        self._steps = steps
+        self._multiplier = multiplier
+        self._method = method
+
+        ops = []
+        for i in range(self._steps):
+            for j in range(2 + i):
+                stride = 2 if reduction and j < 2 else 1
+                op = MixedOp(c_cur, stride, method)
+                ops.append(op)
+        self._ops = fluid.dygraph.LayerList(ops)
+
+    def forward(self, s0, s1, weights, weights2=None):
+        s0 = self.preprocess0(s0)
+        s1 = self.preprocess1(s1)
+
+        states = [s0, s1]
+        offset = 0
+        for i in range(self._steps):
+            if self._method == "PC-DARTS":
+                s = fluid.layers.sums([
+                    weights2[offset + j] *
+                    self._ops[offset + j](h, weights[offset + j])
+                    for j, h in enumerate(states)
+                ])
+            else:
+                s = fluid.layers.sums([
+                    self._ops[offset + j](h, weights[offset + j])
+                    for j, h in enumerate(states)
+                ])
+            offset += len(states)
+            states.append(s)
+        out = fluid.layers.concat(input=states[-self._multiplier:], axis=1)
+        return out
+
+
+class Network(fluid.dygraph.Layer):
+    def __init__(self,
+                 c_in,
+                 num_classes,
+                 layers,
+                 method,
+                 steps=4,
+                 multiplier=4,
+                 stem_multiplier=3):
+        super(Network, self).__init__()
+        self._c_in = c_in
+        self._num_classes = num_classes
+        self._layers = layers
+        self._steps = steps
+        self._multiplier = multiplier
+        self._primitives = PRIMITIVES
+        self._method = method
+
+        c_cur = stem_multiplier * c_in
+        self.stem = fluid.dygraph.Sequential(
+            Conv2D(
+                num_channels=3,
+                num_filters=c_cur,
+                filter_size=3,
+                padding=1,
+                param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+                bias_attr=False),
+            BatchNorm(
+                num_channels=c_cur,
+                param_attr=fluid.ParamAttr(
+                    initializer=ConstantInitializer(value=1)),
+                bias_attr=fluid.ParamAttr(
+                    initializer=ConstantInitializer(value=0))))
+
+        c_prev_prev, c_prev, c_cur = c_cur, c_cur, c_in
+        cells = []
+        reduction_prev = False
+        for i in range(layers):
+            if i in [layers // 3, 2 * layers // 3]:
+                c_cur *= 2
+                reduction = True
+            else:
+                reduction = False
+            cell = Cell(steps, multiplier, c_prev_prev, c_prev, c_cur,
+                        reduction, reduction_prev, method)
+            reduction_prev = reduction
+            cells.append(cell)
+            c_prev_prev, c_prev = c_prev, multiplier * c_cur
+        self.cells = fluid.dygraph.LayerList(cells)
+        self.global_pooling = Pool2D(pool_type='avg', global_pooling=True)
+        self.classifier = Linear(
+            input_dim=c_prev,
+            output_dim=num_classes,
+            param_attr=ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=ParamAttr(initializer=MSRAInitializer()))
+
+        self._initialize_alphas()
+
+    def forward(self, input):
+        s0 = s1 = self.stem(input)
+        weights2 = None
+        for i, cell in enumerate(self.cells):
+            if cell.reduction:
+                weights = fluid.layers.softmax(self.alphas_reduce)
+                if self._method == "PC-DARTS":
+                    n = 3
+                    start = 2
+                    weights2 = fluid.layers.softmax(self.betas_reduce[0:2])
+                    for i in range(self._steps - 1):
+                        end = start + n
+                        tw2 = fluid.layers.softmax(self.betas_reduce[start:
+                                                                     end])
+                        start = end
+                        n += 1
+                        weights2 = fluid.layers.concat([weights2, tw2])
+            else:
+                weights = fluid.layers.softmax(self.alphas_normal)
+                if self._method == "PC-DARTS":
+                    n = 3
+                    start = 2
+                    weights2 = fluid.layers.softmax(self.betas_normal[0:2])
+                    for i in range(self._steps - 1):
+                        end = start + n
+                        tw2 = fluid.layers.softmax(self.betas_normal[start:
+                                                                     end])
+                        start = end
+                        n += 1
+                        weights2 = fluid.layers.concat([weights2, tw2])
+            s0, s1 = s1, cell(s0, s1, weights, weights2)
+        out = self.global_pooling(s1)
+        out = fluid.layers.squeeze(out, axes=[2, 3])
+        logits = self.classifier(out)
+        return logits
+
+    def _loss(self, input, target):
+        logits = self(input)
+        loss = fluid.layers.reduce_mean(
+            fluid.layers.softmax_with_cross_entropy(logits, target))
+        return loss
+
+    def new(self):
+        model_new = Network(self._c_in, self._num_classes, self._layers,
+                            self._method)
+        return model_new
+
+    def _initialize_alphas(self):
+        k = sum(1 for i in range(self._steps) for n in range(2 + i))
+        num_ops = len(self._primitives)
+        self.alphas_normal = fluid.layers.create_parameter(
+            shape=[k, num_ops],
+            dtype="float32",
+            default_initializer=NormalInitializer(
+                loc=0.0, scale=1e-3))
+        self.alphas_reduce = fluid.layers.create_parameter(
+            shape=[k, num_ops],
+            dtype="float32",
+            default_initializer=NormalInitializer(
+                loc=0.0, scale=1e-3))
+        self._arch_parameters = [
+            self.alphas_normal,
+            self.alphas_reduce,
+        ]
+        if self._method == "PC-DARTS":
+            self.betas_normal = fluid.layers.create_parameter(
+                shape=[k],
+                dtype="float32",
+                default_initializer=NormalInitializer(
+                    loc=0.0, scale=1e-3))
+            self.betas_reduce = fluid.layers.create_parameter(
+                shape=[k],
+                dtype="float32",
+                default_initializer=NormalInitializer(
+                    loc=0.0, scale=1e-3))
+            self._arch_parameters += [self.betas_normal, self.betas_reduce]
+
+    def arch_parameters(self):
+        return self._arch_parameters
--- a/demo/darts/operations.py
+++ b/demo/darts/operations.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import ConstantInitializer, MSRAInitializer
+
+
+OPS = {
+    'none':
+    lambda C, stride, affine: Zero(stride),
+    'avg_pool_3x3':
+    lambda C, stride, affine: Pool2D(
+        pool_size=3,
+        pool_type="avg",
+        pool_stride=stride,
+        pool_padding=1),
+    'max_pool_3x3':
+    lambda C, stride, affine: Pool2D(
+        pool_size=3,
+        pool_type="max",
+        pool_stride=stride,
+        pool_padding=1),
+    'skip_connect':
+    lambda C, stride, affine: Identity()
+    if stride == 1 else FactorizedReduce(C, C, affine),
+    'sep_conv_3x3':
+    lambda C, stride, affine: SepConv(C, C, 3, stride, 1,
+                                                       affine),
+    'sep_conv_5x5':
+    lambda C, stride, affine: SepConv(C, C, 5, stride, 2,
+                                                       affine),
+    'sep_conv_7x7':
+    lambda C, stride, affine: SepConv(C, C, 7, stride, 3,
+                                                       affine),
+    'dil_conv_3x3':
+    lambda C, stride, affine: DilConv(C, C, 3, stride, 2,
+                                                       2, affine),
+    'dil_conv_5x5':
+    lambda C, stride, affine: DilConv(C, C, 5, stride, 4,
+                                                       2, affine),
+    'conv_7x1_1x7':
+    lambda C, stride, affine: Conv_7x1_1x7(
+        C, C, stride, affine),
+}
+
+
+def bn_param_config(affine=False):
+    gama = ParamAttr(
+        initializer=ConstantInitializer(value=1), trainable=affine)
+    beta = ParamAttr(
+        initializer=ConstantInitializer(value=0), trainable=affine)
+    return gama, beta
+
+
+class Zero(fluid.dygraph.Layer):
+    def __init__(self, stride):
+        super(Zero, self).__init__()
+        self.stride = stride
+        self.pool = Pool2D(pool_size=1, pool_stride=2)
+
+    def forward(self, x):
+        pooled = self.pool(x)
+        x = fluid.layers.zeros_like(
+            x) if self.stride == 1 else fluid.layers.zeros_like(pooled)
+        return x
+
+
+class Identity(fluid.dygraph.Layer):
+    def __init__(self):
+        super(Identity, self).__init__()
+
+    def forward(self, x):
+        return x
+
+
+class FactorizedReduce(fluid.dygraph.Layer):
+    def __init__(self, c_in, c_out, affine=True):
+        super(FactorizedReduce, self).__init__()
+        assert c_out % 2 == 0
+        self.conv1 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out // 2,
+            filter_size=1,
+            stride=2,
+            padding=0,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.conv2 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out // 2,
+            filter_size=1,
+            stride=2,
+            padding=0,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn = BatchNorm(
+            num_channels=c_out, param_attr=gama, bias_attr=beta)
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        out = fluid.layers.concat(
+            input=[self.conv1(x), self.conv2(x[:, :, 1:, 1:])], axis=1)
+        out = self.bn(out)
+        return out
+
+
+class SepConv(fluid.dygraph.Layer):
+    def __init__(self, c_in, c_out, kernel_size, stride, padding, affine=True):
+        super(SepConv, self).__init__()
+        self.conv1 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_in,
+            filter_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            groups=c_in,
+            use_cudnn=False,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.conv2 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_in,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn1 = BatchNorm(
+            num_channels=c_in, param_attr=gama, bias_attr=beta)
+        self.conv3 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_in,
+            filter_size=kernel_size,
+            stride=1,
+            padding=padding,
+            groups=c_in,
+            use_cudnn=False,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.conv4 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn2 = BatchNorm(
+            num_channels=c_out, param_attr=gama, bias_attr=beta)
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        bn1 = self.bn1(x)
+        x = fluid.layers.relu(bn1)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        bn2 = self.bn2(x)
+        return bn2
+
+
+class DilConv(fluid.dygraph.Layer):
+    def __init__(self,
+                 c_in,
+                 c_out,
+                 kernel_size,
+                 stride,
+                 padding,
+                 dilation,
+                 affine=True):
+        super(DilConv, self).__init__()
+        self.conv1 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_in,
+            filter_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=c_in,
+            use_cudnn=False,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.conv2 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out,
+            filter_size=1,
+            padding=0,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn1 = BatchNorm(
+            num_channels=c_out, param_attr=gama, bias_attr=beta)
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        out = self.bn1(x)
+        return out
+
+
+class Conv_7x1_1x7(fluid.dygraph.Layer):
+    def __init__(self, c_in, c_out, stride, affine=True):
+        super(Conv_7x1_1x7, self).__init__()
+        self.conv1 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out,
+            filter_size=(1, 7),
+            padding=(0, 3),
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        self.conv2 = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out,
+            filter_size=(7, 1),
+            padding=(3, 0),
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn1 = BatchNorm(
+            num_channels=c_out, param_attr=gama, bias_attr=beta)
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        out = self.bn1(x)
+        return out
+
+
+class ReLUConvBN(fluid.dygraph.Layer):
+    def __init__(self, c_in, c_out, kernel_size, stride, padding, affine=True):
+        super(ReLUConvBN, self).__init__()
+        self.conv = Conv2D(
+            num_channels=c_in,
+            num_filters=c_out,
+            filter_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            param_attr=fluid.ParamAttr(initializer=MSRAInitializer()),
+            bias_attr=False)
+        gama, beta = bn_param_config(affine)
+        self.bn = BatchNorm(
+            num_channels=c_out, param_attr=gama, bias_attr=beta)
+
+    def forward(self, x):
+        x = fluid.layers.relu(x)
+        x = self.conv(x)
+        out = self.bn(x)
+        return out
--- a/demo/darts/reader.py
+++ b/demo/darts/reader.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from PIL import Image
+from PIL import ImageOps
+import os
+import math
+import random
+import tarfile
+import functools
+import numpy as np
+from PIL import Image, ImageEnhance
+import paddle
+# for python2/python3 compatiablity
+try:
+    import cPickle
+except:
+    import _pickle as cPickle
+
+IMAGE_SIZE = 32
+IMAGE_DEPTH = 3
+CIFAR_MEAN = [0.49139968, 0.48215827, 0.44653124]
+CIFAR_STD = [0.24703233, 0.24348505, 0.26158768]
+
+CIFAR10_URL = 'https://dataset.bj.bcebos.com/cifar%2Fcifar-10-python.tar.gz'
+CIFAR10_MD5 = 'c58f30108f718f92721af3b95e74349a'
+
+paddle.dataset.common.DATA_HOME = "dataset/"
+
+THREAD = 16
+BUF_SIZE = 10240
+
+IMAGENET_MEAN = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
+IMAGENET_STD = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
+IMAGENET_DIM = 224
+
+
+def preprocess(sample, is_training, args):
+    image_array = sample.reshape(IMAGE_DEPTH, IMAGE_SIZE, IMAGE_SIZE)
+    rgb_array = np.transpose(image_array, (1, 2, 0))
+    img = Image.fromarray(rgb_array, 'RGB')
+
+    if is_training:
+        # pad, ramdom crop, random_flip_left_right
+        img = ImageOps.expand(img, (4, 4, 4, 4), fill=0)
+        left_top = np.random.randint(8, size=2)
+        img = img.crop((left_top[1], left_top[0], left_top[1] + IMAGE_SIZE,
+                        left_top[0] + IMAGE_SIZE))
+        if np.random.randint(2):
+            img = img.transpose(Image.FLIP_LEFT_RIGHT)
+    img = np.array(img).astype(np.float32)
+
+    img_float = img / 255.0
+    img = (img_float - CIFAR_MEAN) / CIFAR_STD
+
+    if is_training and args.cutout:
+        center = np.random.randint(IMAGE_SIZE, size=2)
+        offset_width = max(0, center[0] - args.cutout_length // 2)
+        offset_height = max(0, center[1] - args.cutout_length // 2)
+        target_width = min(center[0] + args.cutout_length // 2, IMAGE_SIZE)
+        target_height = min(center[1] + args.cutout_length // 2, IMAGE_SIZE)
+
+        for i in range(offset_height, target_height):
+            for j in range(offset_width, target_width):
+                img[i][j][:] = 0.0
+
+    img = np.transpose(img, (2, 0, 1))
+    return img
+
+
+def reader_generator(datasets, batch_size, is_training, is_shuffle, args):
+    def read_batch(datasets, args):
+        if is_shuffle:
+            random.shuffle(datasets)
+        for im, label in datasets:
+            im = preprocess(im, is_training, args)
+            yield im, [int(label)]
+
+    def reader():
+        batch_data = []
+        batch_label = []
+        for data in read_batch(datasets, args):
+            batch_data.append(data[0])
+            batch_label.append(data[1])
+            if len(batch_data) == batch_size:
+                batch_data = np.array(batch_data, dtype='float32')
+                batch_label = np.array(batch_label, dtype='int64')
+                batch_out = [batch_data, batch_label]
+                yield batch_out
+                batch_data = []
+                batch_label = []
+
+    return reader
+
+
+def cifar10_reader(file_name, data_name, is_shuffle, args):
+    with tarfile.open(file_name, mode='r') as f:
+        names = [
+            each_item.name for each_item in f if data_name in each_item.name
+        ]
+        names.sort()
+        datasets = []
+        for name in names:
+            print("Reading file " + name)
+            try:
+                batch = cPickle.load(
+                    f.extractfile(name), encoding='iso-8859-1')
+            except:
+                batch = cPickle.load(f.extractfile(name))
+            data = batch['data']
+            labels = batch.get('labels', batch.get('fine_labels', None))
+            assert labels is not None
+            dataset = zip(data, labels)
+            datasets.extend(dataset)
+        if is_shuffle:
+            random.shuffle(datasets)
+    return datasets
+
+
+def train_search(batch_size, train_portion, is_shuffle, args):
+    datasets = cifar10_reader(
+        paddle.dataset.common.download(CIFAR10_URL, 'cifar', CIFAR10_MD5),
+        'data_batch', is_shuffle, args)
+    split_point = int(np.floor(train_portion * len(datasets)))
+    train_datasets = datasets[:split_point]
+    val_datasets = datasets[split_point:]
+    reader = [
+        reader_generator(train_datasets, batch_size, True, True, args),
+        reader_generator(val_datasets, batch_size, True, True, args)
+    ]
+    return reader
+
+
+def train_valid(batch_size, is_train, is_shuffle, args):
+    name = 'data_batch' if is_train else 'test_batch'
+    datasets = cifar10_reader(
+        paddle.dataset.common.download(CIFAR10_URL, 'cifar', CIFAR10_MD5),
+        name, is_shuffle, args)
+
+    reader = reader_generator(datasets, batch_size, is_train, is_shuffle, args)
+    return reader
+
+
+def crop_image(img, target_size, center):
+    width, height = img.size
+    size = target_size
+    if center == True:
+        w_start = (width - size) / 2
+        h_start = (height - size) / 2
+    else:
+        w_start = np.random.randint(0, width - size + 1)
+        h_start = np.random.randint(0, height - size + 1)
+    w_end = w_start + size
+    h_end = h_start + size
+    img = img.crop((w_start, h_start, w_end, h_end))
+    return img
+
+
+def resize_short(img, target_size):
+    percent = float(target_size) / min(img.size[0], img.size[1])
+    resized_width = int(round(img.size[0] * percent))
+    resized_height = int(round(img.size[1] * percent))
+    img = img.resize((resized_width, resized_height), Image.LANCZOS)
+    return img
+
+
+def random_crop(img, size, scale=[0.08, 1.0], ratio=[3. / 4., 4. / 3.]):
+    aspect_ratio = math.sqrt(np.random.uniform(*ratio))
+    w = 1. * aspect_ratio
+    h = 1. / aspect_ratio
+
+    bound = min((float(img.size[0]) / img.size[1]) / (w**2),
+                (float(img.size[1]) / img.size[0]) / (h**2))
+    scale_max = min(scale[1], bound)
+    scale_min = min(scale[0], bound)
+
+    target_area = img.size[0] * img.size[1] * np.random.uniform(scale_min,
+                                                                scale_max)
+    target_size = math.sqrt(target_area)
+    w = int(target_size * w)
+    h = int(target_size * h)
+
+    i = np.random.randint(0, img.size[0] - w + 1)
+    j = np.random.randint(0, img.size[1] - h + 1)
+
+    img = img.crop((i, j, i + w, j + h))
+    img = img.resize((size, size), Image.BILINEAR)
+    return img
+
+
+def distort_color(img):
+    def random_brightness(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Brightness(img).enhance(e)
+
+    def random_contrast(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Contrast(img).enhance(e)
+
+    def random_color(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Color(img).enhance(e)
+
+    ops = [random_brightness, random_contrast, random_color]
+    np.random.shuffle(ops)
+
+    img = ops[0](img)
+    img = ops[1](img)
+    img = ops[2](img)
+
+    return img
+
+
+def process_image(sample, mode, color_jitter, rotate):
+    img_path = sample[0]
+
+    img = Image.open(img_path)
+    if mode == 'train':
+        img = random_crop(img, IMAGENET_DIM)
+        if np.random.randint(0, 2) == 1:
+            img = img.transpose(Image.FLIP_LEFT_RIGHT)
+        if color_jitter:
+            img = distort_color(img)
+
+    else:
+        img = resize_short(img, target_size=256)
+        img = crop_image(img, target_size=IMAGENET_DIM, center=True)
+
+    if img.mode != 'RGB':
+        img = img.convert('RGB')
+
+    img = np.array(img).astype('float32').transpose((2, 0, 1)) / 255
+    img -= IMAGENET_MEAN
+    img /= IMAGENET_STD
+
+    if mode == 'train' or mode == 'val':
+        return img, np.array([sample[1]], dtype='int64')
+    elif mode == 'test':
+        return [img]
+
+
+def _reader_creator(file_list,
+                    mode,
+                    shuffle=False,
+                    color_jitter=False,
+                    rotate=False,
+                    data_dir=None):
+    def reader():
+        try:
+            with open(file_list) as flist:
+                full_lines = [line.strip() for line in flist]
+                if shuffle:
+                    np.random.shuffle(full_lines)
+                lines = full_lines
+                for line in lines:
+                    if mode == 'train' or mode == 'val':
+                        img_path, label = line.split()
+                        img_path = os.path.join(data_dir, img_path)
+                        yield img_path, int(label)
+                    elif mode == 'test':
+                        img_path = os.path.join(data_dir, line)
+                        yield [img_path]
+        except Exception as e:
+            print("Reader failed!\n{}".format(str(e)))
+            os._exit(1)
+
+    mapper = functools.partial(
+        process_image, mode=mode, color_jitter=color_jitter, rotate=rotate)
+    return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
+
+
+def imagenet_reader(data_dir, mode):
+    if mode is 'train':
+        shuffle = True
+        suffix = 'train_list.txt'
+    elif mode is 'val':
+        shuffle = False
+        suffix = 'val_list.txt'
+    file_list = os.path.join(data_dir, suffix)
+    return _reader_creator(file_list, mode, shuffle=shuffle, data_dir=data_dir)
--- a/demo/darts/search.py
+++ b/demo/darts/search.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import ast
+import argparse
+import functools
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+import reader
+from model_search import Network
+from paddleslim.nas.darts import DARTSearch
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
+from utility import add_arguments, print_arguments
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('log_freq',          int,   50,              "Log frequency.")
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
+add_arg('batch_size',        int,   64,              "Minibatch size.")
+add_arg('learning_rate',     float, 0.025,            "The start learning rate.")
+add_arg('momentum',          float, 0.9,             "Momentum.")
+add_arg('use_gpu',           bool,  True,            "Whether use GPU.")
+add_arg('epochs',            int,   50,              "Epoch number.")
+add_arg('init_channels',     int,   16,              "Init channel number.")
+add_arg('layers',            int,   8,               "Total number of layers.")
+add_arg('class_num',         int,   10,              "Class number of dataset.")
+add_arg('trainset_num',      int,   50000,           "images number of trainset.")
+add_arg('model_save_dir',    str,   'search_cifar', "The path to save model.")
+add_arg('grad_clip',         float, 5,               "Gradient clipping.")
+add_arg('arch_learning_rate',float, 3e-4,            "Learning rate for arch encoding.")
+add_arg('method',            str,   'DARTS',         "The search method you would like to use")
+add_arg('epochs_no_archopt', int,   0,               "Epochs not optimize the arch params")
+add_arg('cutout_length',     int,   16,              "Cutout length.")
+add_arg('cutout',            ast.literal_eval,  False, "Whether use cutout.")
+add_arg('unrolled',          ast.literal_eval,  False, "Use one-step unrolled validation loss")
+add_arg('use_data_parallel', ast.literal_eval,  False, "The flag indicating whether to use data parallel mode to train the model.")
+# yapf: enable
+
+
+def main(args):
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    elif not args.use_data_parallel:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+    train_reader, valid_reader = reader.train_search(
+        batch_size=args.batch_size,
+        train_portion=0.5,
+        is_shuffle=True,
+        args=args)
+
+    with fluid.dygraph.guard(place):
+        model = Network(args.init_channels, args.class_num, args.layers,
+                        args.method)
+        searcher = DARTSearch(
+            model,
+            train_reader,
+            valid_reader,
+            place,
+            learning_rate=args.learning_rate,
+            batchsize=args.batch_size,
+            num_imgs=args.trainset_num,
+            arch_learning_rate=args.arch_learning_rate,
+            unrolled=args.unrolled,
+            num_epochs=args.epochs,
+            epochs_no_archopt=args.epochs_no_archopt,
+            use_multiprocess=args.use_multiprocess,
+            use_data_parallel=args.use_data_parallel,
+            save_dir=args.model_save_dir,
+            log_freq=args.log_freq)
+        searcher.train()
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+
+    main(args)
--- a/demo/darts/train.py
+++ b/demo/darts/train.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import ast
+import logging
+import argparse
+import functools
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.nas.darts import count_parameters_in_MB
+
+import genotypes
+import reader
+from model import NetworkCIFAR as Network
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
+from utility import add_arguments, print_arguments
+logger = get_logger(__name__, level=logging.INFO)
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
+add_arg('data',              str,   'dataset/cifar10',"The dir of dataset.")
+add_arg('batch_size',        int,   96,              "Minibatch size.")
+add_arg('learning_rate',     float, 0.025,           "The start learning rate.")
+add_arg('momentum',          float, 0.9,             "Momentum.")
+add_arg('weight_decay',      float, 3e-4,            "Weight_decay.")
+add_arg('use_gpu',           bool,  True,            "Whether use GPU.")
+add_arg('epochs',            int,   600,             "Epoch number.")
+add_arg('init_channels',     int,   36,              "Init channel number.")
+add_arg('layers',            int,   20,              "Total number of layers.")
+add_arg('class_num',         int,   10,              "Class number of dataset.")
+add_arg('trainset_num',      int,   50000,           "images number of trainset.")
+add_arg('model_save_dir',    str,   'eval_cifar',   "The path to save model.")
+add_arg('cutout',            bool,  True,            'Whether use cutout.')
+add_arg('cutout_length',     int,   16,              "Cutout length.")
+add_arg('auxiliary',         bool,  True,            'Use auxiliary tower.')
+add_arg('auxiliary_weight',  float, 0.4,             "Weight for auxiliary loss.")
+add_arg('drop_path_prob',    float, 0.2,             "Drop path probability.")
+add_arg('grad_clip',         float, 5,               "Gradient clipping.")
+add_arg('arch',              str,   'DARTS_V2',      "Which architecture to use")
+add_arg('log_freq',          int,   50,              'Report frequency')
+add_arg('use_data_parallel', ast.literal_eval,  False, "The flag indicating whether to use data parallel mode to train the model.")
+# yapf: enable
+
+
+def train(model, train_reader, optimizer, epoch, drop_path_prob, args):
+    objs = AvgrageMeter()
+    top1 = AvgrageMeter()
+    top5 = AvgrageMeter()
+    model.train()
+
+    for step_id, data in enumerate(train_reader()):
+        image_np, label_np = data
+        image = to_variable(image_np)
+        label = to_variable(label_np)
+        label.stop_gradient = True
+        logits, logits_aux = model(image, drop_path_prob, True)
+
+        prec1 = fluid.layers.accuracy(input=logits, label=label, k=1)
+        prec5 = fluid.layers.accuracy(input=logits, label=label, k=5)
+        loss = fluid.layers.reduce_mean(
+            fluid.layers.softmax_with_cross_entropy(logits, label))
+        if args.auxiliary:
+            loss_aux = fluid.layers.reduce_mean(
+                fluid.layers.softmax_with_cross_entropy(logits_aux, label))
+            loss = loss + args.auxiliary_weight * loss_aux
+
+        if args.use_data_parallel:
+            loss = model.scale_loss(loss)
+            loss.backward()
+            model.apply_collective_grads()
+        else:
+            loss.backward()
+
+        optimizer.minimize(loss)
+        model.clear_gradients()
+
+        n = image.shape[0]
+        objs.update(loss.numpy(), n)
+        top1.update(prec1.numpy(), n)
+        top5.update(prec5.numpy(), n)
+
+        if step_id % args.log_freq == 0:
+            logger.info(
+                "Train Epoch {}, Step {}, loss {:.6f}, acc_1 {:.6f}, acc_5 {:.6f}".
+                format(epoch, step_id, objs.avg[0], top1.avg[0], top5.avg[0]))
+    return top1.avg[0]
+
+
+def valid(model, valid_reader, epoch, args):
+    objs = AvgrageMeter()
+    top1 = AvgrageMeter()
+    top5 = AvgrageMeter()
+    model.eval()
+
+    for step_id, data in enumerate(valid_reader()):
+        image_np, label_np = data
+        image = to_variable(image_np)
+        label = to_variable(label_np)
+        logits, _ = model(image, 0, False)
+        prec1 = fluid.layers.accuracy(input=logits, label=label, k=1)
+        prec5 = fluid.layers.accuracy(input=logits, label=label, k=5)
+        loss = fluid.layers.reduce_mean(
+            fluid.layers.softmax_with_cross_entropy(logits, label))
+
+        n = image.shape[0]
+        objs.update(loss.numpy(), n)
+        top1.update(prec1.numpy(), n)
+        top5.update(prec5.numpy(), n)
+        if step_id % args.log_freq == 0:
+            logger.info(
+                "Valid Epoch {}, Step {}, loss {:.6f}, acc_1 {:.6f}, acc_5 {:.6f}".
+                format(epoch, step_id, objs.avg[0], top1.avg[0], top5.avg[0]))
+    return top1.avg[0]
+
+
+def main(args):
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if args.use_data_parallel else fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        genotype = eval("genotypes.%s" % args.arch)
+        model = Network(
+            C=args.init_channels,
+            num_classes=args.class_num,
+            layers=args.layers,
+            auxiliary=args.auxiliary,
+            genotype=genotype)
+
+        logger.info("param size = {:.6f}MB".format(
+            count_parameters_in_MB(model.parameters())))
+
+        device_num = fluid.dygraph.parallel.Env().nranks
+        step_per_epoch = int(args.trainset_num /
+                             (args.batch_size * device_num))
+        learning_rate = fluid.dygraph.CosineDecay(args.learning_rate,
+                                                  step_per_epoch, args.epochs)
+        clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=args.grad_clip)
+        optimizer = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            momentum=args.momentum,
+            regularization=fluid.regularizer.L2Decay(args.weight_decay),
+            parameter_list=model.parameters(),
+            grad_clip=clip)
+
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        train_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True,
+            use_multiprocess=args.use_multiprocess)
+        valid_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True,
+            use_multiprocess=args.use_multiprocess)
+
+        train_reader = reader.train_valid(
+            batch_size=args.batch_size,
+            is_train=True,
+            is_shuffle=True,
+            args=args)
+        valid_reader = reader.train_valid(
+            batch_size=args.batch_size,
+            is_train=False,
+            is_shuffle=False,
+            args=args)
+        if args.use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+
+        train_loader.set_batch_generator(train_reader, places=place)
+        valid_loader.set_batch_generator(valid_reader, places=place)
+
+        save_parameters = (not args.use_data_parallel) or (
+            args.use_data_parallel and
+            fluid.dygraph.parallel.Env().local_rank == 0)
+        best_acc = 0
+        for epoch in range(args.epochs):
+            drop_path_prob = args.drop_path_prob * epoch / args.epochs
+            logger.info('Epoch {}, lr {:.6f}'.format(
+                epoch, optimizer.current_step_lr()))
+            train_top1 = train(model, train_loader, optimizer, epoch,
+                               drop_path_prob, args)
+            logger.info("Epoch {}, train_acc {:.6f}".format(epoch, train_top1))
+            valid_top1 = valid(model, valid_loader, epoch, args)
+            if valid_top1 > best_acc:
+                best_acc = valid_top1
+                if save_parameters:
+                    fluid.save_dygraph(model.state_dict(),
+                                       args.model_save_dir + "/best_model")
+            logger.info("Epoch {}, valid_acc {:.6f}, best_valid_acc {:.6f}".
+                        format(epoch, valid_top1, best_acc))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+    main(args)
--- a/demo/darts/train_imagenet.py
+++ b/demo/darts/train_imagenet.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import ast
+import logging
+import argparse
+import functools
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.nas.darts import count_parameters_in_MB
+
+import genotypes
+import reader
+from model import NetworkImageNet as Network
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
+from utility import add_arguments, print_arguments
+logger = get_logger(__name__, level=logging.INFO)
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
+add_arg('num_workers',       int,   4,               "The multiprocess reader number.")
+add_arg('data_dir',          str,   'dataset/ILSVRC2012',"The dir of dataset.")
+add_arg('batch_size',        int,   128,             "Minibatch size.")
+add_arg('learning_rate',     float, 0.1,             "The start learning rate.")
+add_arg('decay_rate',        float, 0.97,            "The lr decay rate.")
+add_arg('momentum',          float, 0.9,             "Momentum.")
+add_arg('weight_decay',      float, 3e-5,            "Weight_decay.")
+add_arg('use_gpu',           bool,  True,            "Whether use GPU.")
+add_arg('epochs',            int,   250,             "Epoch number.")
+add_arg('init_channels',     int,   48,              "Init channel number.")
+add_arg('layers',            int,   14,              "Total number of layers.")
+add_arg('class_num',         int,   1000,            "Class number of dataset.")
+add_arg('trainset_num',      int,   1281167,         "Images number of trainset.")
+add_arg('model_save_dir',    str,   'eval_imagenet', "The path to save model.")
+add_arg('auxiliary',         bool,  True,            'Use auxiliary tower.')
+add_arg('auxiliary_weight',  float, 0.4,             "Weight for auxiliary loss.")
+add_arg('drop_path_prob',    float, 0.0,             "Drop path probability.")
+add_arg('dropout',           float, 0.0,             "Dropout probability.")
+add_arg('grad_clip',         float, 5,               "Gradient clipping.")
+add_arg('label_smooth',      float, 0.1,             "Label smoothing.")
+add_arg('arch',              str,   'DARTS_V2',      "Which architecture to use")
+add_arg('log_freq',          int,   100,             'Report frequency')
+add_arg('use_data_parallel', ast.literal_eval,  False, "The flag indicating whether to use data parallel mode to train the model.")
+# yapf: enable
+
+
+def cross_entropy_label_smooth(preds, targets, epsilon):
+    preds = fluid.layers.softmax(preds)
+    targets_one_hot = fluid.one_hot(input=targets, depth=args.class_num)
+    targets_smooth = fluid.layers.label_smooth(
+        targets_one_hot, epsilon=epsilon, dtype="float32")
+    loss = fluid.layers.cross_entropy(
+        input=preds, label=targets_smooth, soft_label=True)
+    return loss
+
+
+def train(model, train_reader, optimizer, epoch, args):
+    objs = AvgrageMeter()
+    top1 = AvgrageMeter()
+    top5 = AvgrageMeter()
+    model.train()
+
+    for step_id, data in enumerate(train_reader()):
+        image_np, label_np = data
+        image = to_variable(image_np)
+        label = to_variable(label_np)
+        label.stop_gradient = True
+        logits, logits_aux = model(image, True)
+
+        prec1 = fluid.layers.accuracy(input=logits, label=label, k=1)
+        prec5 = fluid.layers.accuracy(input=logits, label=label, k=5)
+        loss = fluid.layers.reduce_mean(
+            cross_entropy_label_smooth(logits, label, args.label_smooth))
+
+        if args.auxiliary:
+            loss_aux = fluid.layers.reduce_mean(
+                cross_entropy_label_smooth(logits_aux, label,
+                                           args.label_smooth))
+            loss = loss + args.auxiliary_weight * loss_aux
+
+        if args.use_data_parallel:
+            loss = model.scale_loss(loss)
+            loss.backward()
+            model.apply_collective_grads()
+        else:
+            loss.backward()
+
+        optimizer.minimize(loss)
+        model.clear_gradients()
+
+        n = image.shape[0]
+        objs.update(loss.numpy(), n)
+        top1.update(prec1.numpy(), n)
+        top5.update(prec5.numpy(), n)
+
+        if step_id % args.log_freq == 0:
+            logger.info(
+                "Train Epoch {}, Step {}, loss {:.6f}, acc_1 {:.6f}, acc_5 {:.6f}".
+                format(epoch, step_id, objs.avg[0], top1.avg[0], top5.avg[0]))
+    return top1.avg[0], top5.avg[0]
+
+
+def valid(model, valid_reader, epoch, args):
+    objs = AvgrageMeter()
+    top1 = AvgrageMeter()
+    top5 = AvgrageMeter()
+    model.eval()
+
+    for step_id, data in enumerate(valid_reader()):
+        image_np, label_np = data
+        image = to_variable(image_np)
+        label = to_variable(label_np)
+        logits, _ = model(image, False)
+        prec1 = fluid.layers.accuracy(input=logits, label=label, k=1)
+        prec5 = fluid.layers.accuracy(input=logits, label=label, k=5)
+        loss = fluid.layers.reduce_mean(
+            cross_entropy_label_smooth(logits, label, args.label_smooth))
+
+        n = image.shape[0]
+        objs.update(loss.numpy(), n)
+        top1.update(prec1.numpy(), n)
+        top5.update(prec5.numpy(), n)
+        if step_id % args.log_freq == 0:
+            logger.info(
+                "Valid Epoch {}, Step {}, loss {:.6f}, acc_1 {:.6f}, acc_5 {:.6f}".
+                format(epoch, step_id, objs.avg[0], top1.avg[0], top5.avg[0]))
+    return top1.avg[0], top5.avg[0]
+
+
+def main(args):
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if args.use_data_parallel else fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        genotype = eval("genotypes.%s" % args.arch)
+        model = Network(
+            C=args.init_channels,
+            num_classes=args.class_num,
+            layers=args.layers,
+            auxiliary=args.auxiliary,
+            genotype=genotype)
+
+        logger.info("param size = {:.6f}MB".format(
+            count_parameters_in_MB(model.parameters())))
+
+        device_num = fluid.dygraph.parallel.Env().nranks
+        step_per_epoch = int(args.trainset_num /
+                             (args.batch_size * device_num))
+        learning_rate = fluid.dygraph.ExponentialDecay(
+            args.learning_rate,
+            step_per_epoch,
+            args.decay_rate,
+            staircase=True)
+
+        clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=args.grad_clip)
+        optimizer = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            momentum=args.momentum,
+            regularization=fluid.regularizer.L2Decay(args.weight_decay),
+            parameter_list=model.parameters(),
+            grad_clip=clip)
+
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        train_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        valid_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+
+        train_reader = fluid.io.batch(
+            reader.imagenet_reader(args.data_dir, 'train'),
+            batch_size=args.batch_size,
+            drop_last=True)
+        valid_reader = fluid.io.batch(
+            reader.imagenet_reader(args.data_dir, 'val'),
+            batch_size=args.batch_size)
+        if args.use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+
+        train_loader.set_sample_list_generator(train_reader, places=place)
+        valid_loader.set_sample_list_generator(valid_reader, places=place)
+
+        save_parameters = (not args.use_data_parallel) or (
+            args.use_data_parallel and
+            fluid.dygraph.parallel.Env().local_rank == 0)
+        best_top1 = 0
+        for epoch in range(args.epochs):
+            logger.info('Epoch {}, lr {:.6f}'.format(
+                epoch, optimizer.current_step_lr()))
+            train_top1, train_top5 = train(model, train_loader, optimizer,
+                                           epoch, args)
+            logger.info("Epoch {}, train_top1 {:.6f}, train_top5 {:.6f}".
+                        format(epoch, train_top1, train_top5))
+            valid_top1, valid_top5 = valid(model, valid_loader, epoch, args)
+            if valid_top1 > best_top1:
+                best_top1 = valid_top1
+                if save_parameters:
+                    fluid.save_dygraph(model.state_dict(),
+                                       args.model_save_dir + "/best_model")
+            logger.info(
+                "Epoch {}, valid_top1 {:.6f}, valid_top5 {:.6f}, best_valid_top1 {:6f}".
+                format(epoch, valid_top1, valid_top5, best_top1))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+
+    main(args)
--- a/demo/darts/visualize.py
+++ b/demo/darts/visualize.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import genotypes
+from graphviz import Digraph
+
+
+def plot(genotype_normal, genotype_reduce, filename):
+    g = Digraph(
+        format='png',
+        edge_attr=dict(fontname="times"),
+        node_attr=dict(
+            style='filled',
+            shape='ellipse',
+            align='center',
+            height='0.5',
+            width='0.5',
+            penwidth='2',
+            fontname="times"),
+        engine='dot')
+
+    g.body.extend(['rankdir=LR'])
+
+    g.node("reduce_c_{k-2}", fillcolor='darkseagreen2')
+    g.node("reduce_c_{k-1}", fillcolor='darkseagreen2')
+    g.node("normal_c_{k-2}", fillcolor='darkseagreen2')
+    g.node("normal_c_{k-1}", fillcolor='darkseagreen2')
+    assert len(genotype_normal) % 2 == 0
+    steps = len(genotype_normal) // 2
+
+    for i in range(steps):
+        g.node('n_' + str(i), fillcolor='lightblue')
+    for i in range(steps):
+        g.node('r_' + str(i), fillcolor='lightblue')
+
+    for i in range(steps):
+        for k in [2 * i, 2 * i + 1]:
+            op, j = genotype_normal[k]
+            if j == 0:
+                u = "normal_c_{k-2}"
+            elif j == 1:
+                u = "normal_c_{k-1}"
+            else:
+                u = 'n_' + str(j - 2)
+            v = 'n_' + str(i)
+            g.edge(u, v, label=op, fillcolor="gray")
+
+    for i in range(steps):
+        for k in [2 * i, 2 * i + 1]:
+            op, j = genotype_reduce[k]
+            if j == 0:
+                u = "reduce_c_{k-2}"
+            elif j == 1:
+                u = "reduce_c_{k-1}"
+            else:
+                u = 'r_' + str(j - 2)
+            v = 'r_' + str(i)
+            g.edge(u, v, label=op, fillcolor="gray")
+
+    g.node("r_c_{k}", fillcolor='palegoldenrod')
+    for i in range(steps):
+        g.edge('r_' + str(i), "r_c_{k}", fillcolor="gray")
+    g.node("n_c_{k}", fillcolor='palegoldenrod')
+    for i in range(steps):
+        g.edge('n_' + str(i), "n_c_{k}", fillcolor="gray")
+    g.render(filename, view=False)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) != 2:
+        print("usage:\n python {} ARCH_NAME".format(sys.argv[0]))
+        sys.exit(1)
+    genotype_name = sys.argv[1]
+
+    try:
+        genotype = eval('genotypes.{}'.format(genotype_name))
+    except AttributeError:
+        print("{} is not specified in genotypes.py".format(genotype_name))
+        sys.exit(1)
+
+    plot(genotype.normal, genotype.reduce, genotype_name)
--- a/demo/deep_mutual_learning/README.md
+++ b/demo/deep_mutual_learning/README.md
+# 深度互学习DML(Deep Mutual Learning)
+本示例介绍如何使用PaddleSlim的深度互学习DML方法训练模型，算法原理请参考论文 [Deep Mutual Learning](https://arxiv.org/abs/1706.00384)
+
+![dml_architect](./images/dml_architect.png)
+
+## 使用数据
+
+示例中使用cifar100数据集进行训练, 您可以在启动训练时等待自动下载，
+也可以在自行下载[数据集](https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz)之后，放在当前目录的`./dataset/cifar`路径下
+
+## 启动命令
+
+### 训练MobileNet-Mobilenet的组合
+
+以0号GPU为例：
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python dml_train.py
+```
+### 训练MobileNet-ResNet50的组合
+
+以0号GPU为例：
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python dml_train.py --models='mobilenet-resnet50'
+```
+
+
+## 实验结果
+
+以下实验结果可以由默认实验配置(学习率、优化器等)训练得到，仅调整了DML训练的模型组合
+
+如果想进一步提升实验结果可以尝试[更多优化tricks](https://arxiv.org/abs/1812.01187), 或进一步增加一次DML训练的模型数量。
+
+| 数据集 | 网络模型 |  单独训练准确率 | 深度互学习准确率 |
+| ------ | ------ | ------ | ------ |
+| CIFAR100 | MobileNet X 2 | 73.65% | 76.34% (+2.69%) |
+| CIFAR100 | MobileNet X 4 | 73.65% | 76.56% (+2.91%) |
+| CIFAR100 | MobileNet + ResNet50 | 73.65%/76.52% | 76.00%/77.80% (+2.35%/+1.28%) |
--- a/demo/deep_mutual_learning/cifar100_reader.py
+++ b/demo/deep_mutual_learning/cifar100_reader.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from PIL import Image
+from PIL import ImageOps
+import os
+import math
+import random
+import tarfile
+import functools
+import numpy as np
+from PIL import Image, ImageEnhance
+import paddle
+# for python2/python3 compatiablity
+try:
+    import cPickle
+except:
+    import _pickle as cPickle
+
+IMAGE_SIZE = 32
+IMAGE_DEPTH = 3
+CIFAR_MEAN = [0.5070751592371323, 0.48654887331495095, 0.4409178433670343]
+CIFAR_STD = [0.2673342858792401, 0.2564384629170883, 0.27615047132568404]
+
+URL_PREFIX = 'https://www.cs.toronto.edu/~kriz/'
+CIFAR100_URL = URL_PREFIX + 'cifar-100-python.tar.gz'
+CIFAR100_MD5 = 'eb9058c3a382ffc7106e4002c42a8d85'
+paddle.dataset.common.DATA_HOME = "dataset/"
+
+
+def preprocess(sample, is_training):
+    image_array = sample.reshape(IMAGE_DEPTH, IMAGE_SIZE, IMAGE_SIZE)
+    rgb_array = np.transpose(image_array, (1, 2, 0))
+    img = Image.fromarray(rgb_array, 'RGB')
+
+    if is_training:
+        # pad, ramdom crop, random_flip_left_right, random_rotation
+        img = ImageOps.expand(img, (4, 4, 4, 4), fill=0)
+        left_top = np.random.randint(8, size=2)
+        img = img.crop((left_top[1], left_top[0], left_top[1] + IMAGE_SIZE,
+                        left_top[0] + IMAGE_SIZE))
+        if np.random.randint(2):
+            img = img.transpose(Image.FLIP_LEFT_RIGHT)
+        random_angle = np.random.randint(-15, 15)
+        img = img.rotate(random_angle, Image.NEAREST)
+    img = np.array(img).astype(np.float32)
+
+    img_float = img / 255.0
+    img = (img_float - CIFAR_MEAN) / CIFAR_STD
+
+    img = np.transpose(img, (2, 0, 1))
+    return img
+
+
+def reader_generator(datasets, batch_size, is_training, is_shuffle):
+    def read_batch(datasets):
+        if is_shuffle:
+            random.shuffle(datasets)
+        for im, label in datasets:
+            im = preprocess(im, is_training)
+            yield im, [int(label)]
+
+    def reader():
+        batch_data = []
+        batch_label = []
+        for data in read_batch(datasets):
+            batch_data.append(data[0])
+            batch_label.append(data[1])
+            if len(batch_data) == batch_size:
+                batch_data = np.array(batch_data, dtype='float32')
+                batch_label = np.array(batch_label, dtype='int64')
+                batch_out = [batch_data, batch_label]
+                yield batch_out
+                batch_data = []
+                batch_label = []
+
+    return reader
+
+
+def cifar100_reader(file_name, data_name, is_shuffle):
+    with tarfile.open(file_name, mode='r') as f:
+        names = [
+            each_item.name for each_item in f if data_name in each_item.name
+        ]
+        names.sort()
+        datasets = []
+        for name in names:
+            print("Reading file " + name)
+            try:
+                batch = cPickle.load(f.extractfile(name), encoding='iso-8859-1')
+            except:
+                batch = cPickle.load(f.extractfile(name))
+            data = batch['data']
+            labels = batch.get('labels', batch.get('fine_labels', None))
+            assert labels is not None
+            dataset = zip(data, labels)
+            datasets.extend(dataset)
+        if is_shuffle:
+            random.shuffle(datasets)
+    return datasets
+
+
+def train_valid(batch_size, is_train, is_shuffle):
+    name = 'train' if is_train else 'test'
+    datasets = cifar100_reader(
+        paddle.dataset.common.download(CIFAR100_URL, 'cifar', CIFAR100_MD5),
+        name, is_shuffle)
+    reader = reader_generator(datasets, batch_size, is_train, is_shuffle)
+    return reader
--- a/demo/deep_mutual_learning/dml_train.py
+++ b/demo/deep_mutual_learning/dml_train.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import argparse
+import functools
+import logging
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.dist import DML
+from paddleslim.models.dygraph import MobileNetV1
+from paddleslim.models.dygraph import ResNet
+import cifar100_reader as reader
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
+from utility import add_arguments, print_arguments
+
+logger = get_logger(__name__, level=logging.INFO)
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('log_freq',          int,   100,              "Log frequency.")
+add_arg('models',            str,   "mobilenet-mobilenet",  "model.")
+add_arg('batch_size',        int,   256,             "Minibatch size.")
+add_arg('init_lr',           float, 0.1,             "The start learning rate.")
+add_arg('use_gpu',           bool,  True,            "Whether use GPU.")
+add_arg('epochs',            int,   200,             "Epoch number.")
+add_arg('class_num',         int,   100,             "Class number of dataset.")
+add_arg('trainset_num',      int,   50000,           "Images number of trainset.")
+add_arg('model_save_dir',    str,   'saved_models',  "The path to save model.")
+add_arg('use_parallel',      bool,  False,           "Whether to use data parallel mode to train the model.")
+# yapf: enable
+
+
+def create_optimizer(models, args):
+    device_num = fluid.dygraph.parallel.Env().nranks
+    step = int(args.trainset_num / (args.batch_size * device_num))
+    epochs = [60, 120, 180]
+    bd = [step * e for e in epochs]
+    lr = [args.init_lr * (0.1**i) for i in range(len(bd) + 1)]
+
+    optimizers = []
+    for cur_model in models:
+        learning_rate = fluid.dygraph.PiecewiseDecay(bd, lr, 0)
+        opt = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            0.9,
+            parameter_list=cur_model.parameters(),
+            use_nesterov=True,
+            regularization=fluid.regularizer.L2DecayRegularizer(5e-4))
+        optimizers.append(opt)
+    return optimizers
+
+
+def create_reader(place, args):
+    train_reader = reader.train_valid(
+        batch_size=args.batch_size, is_train=True, is_shuffle=True)
+    valid_reader = reader.train_valid(
+        batch_size=args.batch_size, is_train=False, is_shuffle=False)
+    if args.use_parallel:
+        train_reader = fluid.contrib.reader.distributed_batch_reader(
+            train_reader)
+    train_loader = fluid.io.DataLoader.from_generator(
+        capacity=1024, return_list=True)
+    valid_loader = fluid.io.DataLoader.from_generator(
+        capacity=1024, return_list=True)
+    train_loader.set_batch_generator(train_reader, places=place)
+    valid_loader.set_batch_generator(valid_reader, places=place)
+    return train_loader, valid_loader
+
+
+def train(train_loader, dml_model, dml_optimizer, args):
+    dml_model.train()
+    costs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    accs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    for step_id, (images, labels) in enumerate(train_loader):
+        images, labels = to_variable(images), to_variable(labels)
+        batch_size = images.shape[0]
+
+        logits = dml_model.forward(images)
+        precs = [
+            fluid.layers.accuracy(
+                input=l, label=labels, k=1) for l in logits
+        ]
+        losses = dml_model.loss(logits, labels)
+        dml_optimizer.minimize(losses)
+
+        for i in range(dml_model.model_num):
+            accs[i].update(precs[i].numpy(), batch_size)
+            costs[i].update(losses[i].numpy(), batch_size)
+        model_names = dml_model.full_name()
+        if step_id % args.log_freq == 0:
+            log_msg = "Train Step {}".format(step_id)
+            for model_id, (cost, acc) in enumerate(zip(costs, accs)):
+                log_msg += ", {} loss: {:.6f} acc: {:.6f}".format(
+                    model_names[model_id], cost.avg[0], acc.avg[0])
+            logger.info(log_msg)
+    return costs, accs
+
+
+def valid(valid_loader, dml_model, args):
+    dml_model.eval()
+    costs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    accs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    for step_id, (images, labels) in enumerate(valid_loader):
+        images, labels = to_variable(images), to_variable(labels)
+        batch_size = images.shape[0]
+
+        logits = dml_model.forward(images)
+        precs = [
+            fluid.layers.accuracy(
+                input=l, label=labels, k=1) for l in logits
+        ]
+        losses = dml_model.loss(logits, labels)
+
+        for i in range(dml_model.model_num):
+            accs[i].update(precs[i].numpy(), batch_size)
+            costs[i].update(losses[i].numpy(), batch_size)
+        model_names = dml_model.full_name()
+        if step_id % args.log_freq == 0:
+            log_msg = "Valid Step{} ".format(step_id)
+            for model_id, (cost, acc) in enumerate(zip(costs, accs)):
+                log_msg += ", {} loss: {:.6f} acc: {:.6f}".format(
+                    model_names[model_id], cost.avg[0], acc.avg[0])
+            logger.info(log_msg)
+    return costs, accs
+
+
+def main(args):
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    elif not args.use_parallel:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+    with fluid.dygraph.guard(place):
+        # 1. Define data reader
+        train_loader, valid_loader = create_reader(place, args)
+
+        # 2. Define neural network
+        if args.models == "mobilenet-mobilenet":
+            models = [
+                MobileNetV1(class_dim=args.class_num),
+                MobileNetV1(class_dim=args.class_num)
+            ]
+        elif args.models == "mobilenet-resnet50":
+            models = [
+                MobileNetV1(class_dim=args.class_num),
+                ResNet(class_dim=args.class_num)
+            ]
+        else:
+            logger.info("You can define the model as you wish")
+            return
+        optimizers = create_optimizer(models, args)
+
+        # 3. Use PaddleSlim DML strategy
+        dml_model = DML(models, args.use_parallel)
+        dml_optimizer = dml_model.opt(optimizers)
+
+        # 4. Train your network
+        save_parameters = (not args.use_parallel) or (
+            args.use_parallel and fluid.dygraph.parallel.Env().local_rank == 0)
+        best_valid_acc = [0] * dml_model.model_num
+        for epoch_id in range(args.epochs):
+            current_step_lr = dml_optimizer.get_lr()
+            lr_msg = "Epoch {}".format(epoch_id)
+            for model_id, lr in enumerate(current_step_lr):
+                lr_msg += ", {} lr: {:.6f}".format(
+                    dml_model.full_name()[model_id], lr)
+            logger.info(lr_msg)
+            train_losses, train_accs = train(train_loader, dml_model,
+                                             dml_optimizer, args)
+            valid_losses, valid_accs = valid(valid_loader, dml_model, args)
+            for i in range(dml_model.model_num):
+                if valid_accs[i].avg[0] > best_valid_acc[i]:
+                    best_valid_acc[i] = valid_accs[i].avg[0]
+                    if save_parameters:
+                        fluid.save_dygraph(
+                            models[i].state_dict(),
+                            os.path.join(args.model_save_dir,
+                                         dml_model.full_name()[i],
+                                         "best_model"))
+                summery_msg = "Epoch {} {}: valid_loss {:.6f}, valid_acc {:.6f}, best_valid_acc {:.6f}"
+                logger.info(
+                    summery_msg.format(epoch_id,
+                                       dml_model.full_name()[i], valid_losses[
+                                           i].avg[0], valid_accs[i].avg[0],
+                                       best_valid_acc[i]))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+    main(args)
--- a/demo/deep_mutual_learning/images/dml_architect.png
+++ b/demo/deep_mutual_learning/images/dml_architect.png
--- a/demo/detection/README.md
+++ b/demo/detection/README.md
--- a/demo/distillation/README.md
+++ b/demo/distillation/README.md
+# 知识蒸馏示例
+
+本示例将介绍如何使用知识蒸馏接口训练模型，蒸馏训练得到的模型相比不使用蒸馏策略的基线模型在精度上会有一定的提升。
+
+## 接口介绍
+
+请参考 [知识蒸馏API文档](https://paddlepaddle.github.io/PaddleSlim/api/single_distiller_api/)。
+
+### 1. 蒸馏训练配置
+
+示例使用ResNet50_vd作为teacher模型，对MobileNet结构的student网络进行蒸馏训练。
+
+默认配置:
+
+```yaml
+batch_size: 256
+init_lr: 0.1
+lr_strategy: piecewise_decay
+l2_decay: 3e-5
+momentum_rate: 0.9
+num_epochs: 120
+data: imagenet
+```
+训练使用默认配置启动即可
+
+### 2. 启动训练
+
+在配置好ImageNet数据集后，用以下命令启动训练即可:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 python distill.py
+```
+
+### 3. 训练结果
+
+对比不使用蒸馏策略的基线模型(Top-1/Top-5: 70.99%/89.68%)，
+
+经过120轮的蒸馏训练，MobileNet模型的Top-1/Top-5准确率达到72.77%/90.68%, Top-1/Top-5性能提升+1.78%/+1.00%
+
+详细实验数据请参见[PaddleSlim模型库蒸馏部分](https://paddlepaddle.github.io/PaddleSlim/model_zoo/#13)
--- a/demo/distillation/distillation_demo.py
+++ b/demo/distillation/distillation_demo.py
@@ -11,7 +11,7 @@ import argparse
 import functools
 import numpy as np
 import paddle.fluid as fluid
-sys.path.append(sys.path[0] + "/../")
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
 import models
 from utility import add_arguments, print_arguments, _download, _decompress
 from paddleslim.dist import merge, l2_loss, soft_label_loss, fsp_loss
@@ -23,8 +23,9 @@ _logger.setLevel(logging.INFO)
 parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 # yapf: disable
-add_arg('batch_size',       int,  64*4,                 "Minibatch size.")
+add_arg('batch_size',       int,  64,                 "Minibatch size.")
 add_arg('use_gpu',          bool, True,                "Whether to use GPU or not.")
+add_arg('save_inference',   bool, False,                "Whether to save inference model.")
 add_arg('total_images',     int,  1281167,              "Training image number.")
 add_arg('image_shape',      str,  "3,224,224",         "Input image size")
 add_arg('lr',               float,  0.1,               "The learning rate used to fine-tune pruned model.")
@@ -32,12 +33,12 @@ add_arg('lr_strategy',      str,  "piecewise_decay",   "The learning rate decay
 add_arg('l2_decay',         float,  3e-5,               "The l2_decay parameter.")
 add_arg('momentum_rate',    float,  0.9,               "The value of momentum_rate.")
 add_arg('num_epochs',       int,  120,               "The number of total epochs.")
-add_arg('data',             str, "cifar10",                 "Which data to use. 'cifar10' or 'imagenet'")
+add_arg('data',             str, "imagenet",                 "Which data to use. 'cifar10' or 'imagenet'")
 add_arg('log_period',       int, 20,                 "Log period in batches.")
 add_arg('model',            str,  "MobileNet",          "Set the network to use.")
 add_arg('pretrained_model', str,  None,                "Whether to use pretrained model.")
-add_arg('teacher_model',    str,  "ResNet50",          "Set the teacher network to use.")
-add_arg('teacher_pretrained_model', str,  "./ResNet50_pretrained",                "Whether to use pretrained model.")
+add_arg('teacher_model',    str,  "ResNet50_vd",          "Set the teacher network to use.")
+add_arg('teacher_pretrained_model', str,  "./ResNet50_vd_pretrained",                "Whether to use pretrained model.")
 parser.add_argument('--step_epochs', nargs='+', type=int, default=[30, 60, 90], help="piecewise decay step")
 # yapf: enable

@@ -45,7 +46,12 @@ model_list = [m for m in dir(models) if "__" not in m]


 def piecewise_decay(args):
-    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    if args.use_gpu:
+        devices_num = fluid.core.get_cuda_device_count()
+    else:
+        devices_num = int(os.environ.get('CPU_NUM', 1))
+    step = int(
+        math.ceil(float(args.total_images) / args.batch_size) / devices_num)
    bd = [step * e for e in args.step_epochs]
    lr = [args.lr * (0.1**i) for i in range(len(bd) + 1)]
    learning_rate = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
@@ -53,18 +59,23 @@ def piecewise_decay(args):
        learning_rate=learning_rate,
        momentum=args.momentum_rate,
        regularization=fluid.regularizer.L2Decay(args.l2_decay))
-    return optimizer
+    return learning_rate, optimizer


 def cosine_decay(args):
-    step = int(math.ceil(float(args.total_images) / args.batch_size))
+    if cfg.use_gpu:
+        devices_num = fluid.core.get_cuda_device_count()
+    else:
+        devices_num = int(os.environ.get('CPU_NUM', 1))
+    step = int(
+        math.ceil(float(args.total_images) / args.batch_size) / devices_num)
    learning_rate = fluid.layers.cosine_decay(
        learning_rate=args.lr, step_each_epoch=step, epochs=args.num_epochs)
    optimizer = fluid.optimizer.Momentum(
        learning_rate=learning_rate,
        momentum=args.momentum_rate,
        regularization=fluid.regularizer.L2Decay(args.l2_decay))
-    return optimizer
+    return learning_rate, optimizer


 def create_optimizer(args):
@@ -118,16 +129,13 @@ def compress(args):
            avg_cost = fluid.layers.mean(x=cost)
            acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
            acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
-    #print("="*50+"student_model_params"+"="*50)
-    #for v in student_program.list_vars():
-    #    print(v.name, v.shape)

    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)

-    train_reader = paddle.batch(
+    train_reader = paddle.fluid.io.batch(
        train_reader, batch_size=args.batch_size, drop_last=True)
-    val_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(
        val_reader, batch_size=args.batch_size, drop_last=True)
    val_program = student_program.clone(for_test=True)

@@ -145,21 +153,23 @@ def compress(args):
                name='image', shape=image_shape, dtype='float32')
            predict = teacher_model.net(image, class_dim=class_dim)

-    #print("="*50+"teacher_model_params"+"="*50)
-    #for v in teacher_program.list_vars():
-    #    print(v.name, v.shape)
-
    exe.run(t_startup)
-    _download('http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.tar', '.')
-    _decompress('./ResNet50_pretrained.tar')
+    if not os.path.exists(args.teacher_pretrained_model):
+        _download(
+            'http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_pretrained.tar',
+            '.')
+        _decompress('./ResNet50_vd_pretrained.tar')
    assert args.teacher_pretrained_model and os.path.exists(
        args.teacher_pretrained_model
    ), "teacher_pretrained_model should be set when teacher_model is not None."

    def if_exist(var):
-        return os.path.exists(
-            os.path.join(args.teacher_pretrained_model, var.name)
-        ) and var.name != 'fc_0.w_0' and var.name != 'fc_0.b_0'
+        exist = os.path.exists(
+            os.path.join(args.teacher_pretrained_model, var.name))
+        if args.data == "cifar10" and (var.name == 'fc_0.w_0' or
+                                       var.name == 'fc_0.b_0'):
+            exist = False
+        return exist

    fluid.io.load_vars(
        exe,
@@ -168,35 +178,33 @@ def compress(args):
        predicate=if_exist)

    data_name_map = {'image': 'image'}
-    main = merge(
-        teacher_program,
-        student_program,
-        data_name_map,
-        place)
-
-    with fluid.program_guard(main, s_startup):
-        l2_loss = l2_loss("teacher_fc_0.tmp_0", "fc_0.tmp_0", main)
-        loss = avg_cost + l2_loss
-        opt = create_optimizer(args)
+    merge(teacher_program, student_program, data_name_map, place)
+
+    with fluid.program_guard(student_program, s_startup):
+        distill_loss = soft_label_loss("teacher_fc_0.tmp_0", "fc_0.tmp_0",
+                                       student_program)
+        loss = avg_cost + distill_loss
+        lr, opt = create_optimizer(args)
        opt.minimize(loss)
    exe.run(s_startup)
    build_strategy = fluid.BuildStrategy()
    build_strategy.fuse_all_reduce_ops = False
-    parallel_main = fluid.CompiledProgram(main).with_data_parallel(
+    parallel_main = fluid.CompiledProgram(student_program).with_data_parallel(
        loss_name=loss.name, build_strategy=build_strategy)

    for epoch_id in range(args.num_epochs):
        for step_id, data in enumerate(train_loader):
-            loss_1, loss_2, loss_3 = exe.run(
+            lr_np, loss_1, loss_2, loss_3 = exe.run(
                parallel_main,
                feed=data,
                fetch_list=[
-                    loss.name, avg_cost.name, l2_loss.name
+                    lr.name, loss.name, avg_cost.name, distill_loss.name
                ])
            if step_id % args.log_period == 0:
                _logger.info(
-                    "train_epoch {} step {} loss {:.6f}, class loss {:.6f}, l2 loss {:.6f}".
-                    format(epoch_id, step_id, loss_1[0], loss_2[0], loss_3[0]))
+                    "train_epoch {} step {} lr {:.6f}, loss {:.6f}, class loss {:.6f}, distill loss {:.6f}".
+                    format(epoch_id, step_id, lr_np[0], loss_1[0], loss_2[0],
+                           loss_3[0]))
        val_acc1s = []
        val_acc5s = []
        for step_id, data in enumerate(valid_loader):
@@ -211,6 +219,10 @@ def compress(args):
                    "valid_epoch {} step {} loss {:.6f}, top1 {:.6f}, top5 {:.6f}".
                    format(epoch_id, step_id, val_loss[0], val_acc1[0],
                           val_acc5[0]))
+        if args.save_inference:
+            fluid.io.save_inference_model(
+                os.path.join("./saved_models", str(epoch_id)), ["image"],
+                [out], exe, student_program)
        _logger.info("epoch {} top1 {:.6f}, top5 {:.6f}".format(
            epoch_id, np.mean(val_acc1s), np.mean(val_acc5s)))


--- a/demo/distillation/image_classification_distillation_tutorial.ipynb
+++ b/demo/distillation/image_classification_distillation_tutorial.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# PaddleSlim Distillation知识蒸馏简介与实验\n",
+    "\n",
+    "一般情况下，模型参数量越多，结构越复杂，其性能越好，但参数也越冗余，运算量和资源消耗也越大。**知识蒸馏**就是一种将大模型学习到的有用信息（Dark Knowledge）压缩进更小更快的模型，而获得可以匹敌大模型结果的方法。\n",
+    "\n",
+    "在本文中性能强劲的大模型被称为teacher, 性能稍逊但体积较小的模型被称为student。示例包含以下步骤：\n",
+    "\n",
+    "1. 导入依赖\n",
+    "2. 定义student_program和teacher_program\n",
+    "3. 选择特征图\n",
+    "4. 合并program (merge)并添加蒸馏loss\n",
+    "5. 模型训练\n",
+    "\n",
+    "\n",
+    "## 1. 导入依赖\n",
+    "PaddleSlim依赖Paddle1.7版本，请确认已正确安装Paddle，然后按以下方式导入Paddle、PaddleSlim以及其他依赖:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import paddle\n",
+    "import paddle.fluid as fluid\n",
+    "import paddleslim as slim\n",
+    "import sys\n",
+    "sys.path.append(\"../\")\n",
+    "import models"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. 定义student_program和teacher_program\n",
+    "\n",
+    "本教程在MNIST数据集上进行知识蒸馏的训练和验证，输入图片尺寸为`[1, 28, 28]`，输出类别数为10。\n",
+    "选择`ResNet50`作为teacher对`MobileNet`结构的student进行蒸馏训练。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "model = models.__dict__['MobileNet']()\n",
+    "student_program = fluid.Program()\n",
+    "student_startup = fluid.Program()\n",
+    "with fluid.program_guard(student_program, student_startup):\n",
+    "    image = fluid.data(\n",
+    "        name='image', shape=[None] + [1, 28, 28], dtype='float32')\n",
+    "    label = fluid.data(name='label', shape=[None, 1], dtype='int64')\n",
+    "    out = model.net(input=image, class_dim=10)\n",
+    "    cost = fluid.layers.cross_entropy(input=out, label=label)\n",
+    "    avg_cost = fluid.layers.mean(x=cost)\n",
+    "    acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)\n",
+    "    acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "teacher_model = models.__dict__['ResNet50']()\n",
+    "teacher_program = fluid.Program()\n",
+    "teacher_startup = fluid.Program()\n",
+    "with fluid.program_guard(teacher_program, teacher_startup):\n",
+    "    with fluid.unique_name.guard():\n",
+    "        image = fluid.data(\n",
+    "            name='image', shape=[None] + [1, 28, 28], dtype='float32')\n",
+    "        predict = teacher_model.net(image, class_dim=10)\n",
+    "exe = fluid.Executor(fluid.CPUPlace())\n",
+    "exe.run(teacher_startup)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. 选择特征图\n",
+    "我们可以用student_的list_vars方法来观察其中全部的Variables，从中选出一个或多个变量（Variable）来拟合teacher相应的变量。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get all student variables\n",
+    "student_vars = []\n",
+    "for v in student_program.list_vars():\n",
+    "    student_vars.append((v.name, v.shape))\n",
+    "#uncomment the following lines to observe student's variables for distillation\n",
+    "#print(\"=\"*50+\"student_model_vars\"+\"=\"*50)\n",
+    "#print(student_vars)\n",
+    "\n",
+    "# get all teacher variables\n",
+    "teacher_vars = []\n",
+    "for v in teacher_program.list_vars():\n",
+    "    teacher_vars.append((v.name, v.shape))\n",
+    "#uncomment the following lines to observe teacher's variables for distillation\n",
+    "#print(\"=\"*50+\"teacher_model_vars\"+\"=\"*50)\n",
+    "#print(teacher_vars)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "经过筛选我们可以看到，teacher_program中的'bn5c_branch2b.output.1.tmp_3'和student_program的'depthwise_conv2d_11.tmp_0'尺寸一致，可以组成蒸馏损失函数。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. 合并program (merge)并添加蒸馏loss\n",
+    "merge操作将student_program和teacher_program中的所有Variables和Op都将被添加到同一个Program中，同时为了避免两个program中有同名变量会引起命名冲突，merge也会为teacher_program中的Variables添加一个同一的命名前缀name_prefix，其默认值是'teacher_'\n",
+    "\n",
+    "为了确保teacher网络和student网络输入的数据是一样的，merge操作也会对两个program的输入数据层进行合并操作，所以需要指定一个数据层名称的映射关系data_name_map，key是teacher的输入数据名称，value是student的"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_name_map = {'image': 'image'}\n",
+    "main = slim.dist.merge(teacher_program, student_program, data_name_map, fluid.CPUPlace())\n",
+    "with fluid.program_guard(student_program, student_startup):\n",
+    "    l2_loss = slim.dist.l2_loss('teacher_bn5c_branch2b.output.1.tmp_3', 'depthwise_conv2d_11.tmp_0', student_program)\n",
+    "    loss = l2_loss + avg_cost\n",
+    "    opt = fluid.optimizer.Momentum(0.01, 0.9)\n",
+    "    opt.minimize(loss)\n",
+    "exe.run(student_startup)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. 模型训练\n",
+    "\n",
+    "为了快速执行该示例，我们选取简单的MNIST数据，Paddle框架的`paddle.dataset.mnist`包定义了MNIST数据的下载和读取。\n",
+    "代码如下："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_reader = paddle.fluid.io.batch(\n",
+    "    paddle.dataset.mnist.train(), batch_size=128, drop_last=True)\n",
+    "train_feeder = fluid.DataFeeder(['image', 'label'], fluid.CPUPlace(), student_program)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for data in train_reader():\n",
+    "    acc1, acc5, loss_np = exe.run(student_program, feed=train_feeder.feed(data), fetch_list=[acc_top1.name, acc_top5.name, loss.name])\n",
+    "    print(\"Acc1: {:.6f}, Acc5: {:.6f}, Loss: {:.6f}\".format(acc1.mean(), acc5.mean(), loss_np.mean()))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/demo/imagenet_reader.py
+++ b/demo/imagenet_reader.py
@@ -14,8 +14,7 @@ DATA_DIM = 224
 THREAD = 16
 BUF_SIZE = 10240

-#DATA_DIR = './data/ILSVRC2012/'
-DATA_DIR = './data/'
+DATA_DIR = './data/ILSVRC2012/'
 DATA_DIR = os.path.join(os.path.split(os.path.realpath(__file__))[0], DATA_DIR)

 img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
@@ -101,7 +100,11 @@ def distort_color(img):
 def process_image(sample, mode, color_jitter, rotate):
    img_path = sample[0]

-    img = Image.open(img_path)
+    try:
+        img = Image.open(img_path)
+    except:
+        print(img_path, "not exists!")
+        return None
    if mode == 'train':
        if rotate: img = rotate_image(img)
        img = random_crop(img, DATA_DIM)
@@ -157,8 +160,7 @@ def _reader_creator(file_list,
                for line in lines:
                    if mode == 'train' or mode == 'val':
                        img_path, label = line.split()
-                        img_path = os.path.join(data_dir + "/" + mode,
-                                                img_path)
+                        img_path = os.path.join(data_dir, img_path)
                        yield img_path, int(label)
                    elif mode == 'test':
                        img_path = os.path.join(data_dir, line)

--- a/demo/mkldnn_quant/CMakeLists.txt
+++ b/demo/mkldnn_quant/CMakeLists.txt
+CMAKE_MINIMUM_REQUIRED(VERSION 3.2)
+
+project(mkldnn_quantaware_demo CXX C)
+set(DEMO_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
+set(DEMO_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
+
+option(USE_GPU      "Compile the inference code with the support CUDA GPU" OFF)
+option(USE_PROFILER "Whether enable Paddle's profiler." OFF)
+
+set(USE_SHARED OFF)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+if(NOT PADDLE_ROOT)
+  set(PADDLE_ROOT ${DEMO_SOURCE_DIR}/fluid_inference)
+endif()
+find_package(Fluid)
+
+set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3")
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -std=c++11")
+
+if(USE_PROFILER)
+    find_package(Gperftools REQUIRED)
+    include_directories(${GPERFTOOLS_INCLUDE_DIR})
+    add_definitions(-DWITH_GPERFTOOLS)
+endif()
+
+include_directories(${CMAKE_CURRENT_SOURCE_DIR})
+
+if(PADDLE_FOUND)
+  add_executable(inference sample_tester.cc)
+  target_link_libraries(inference
+      ${PADDLE_LIBRARIES}
+      ${PADDLE_THIRD_PARTY_LIBRARIES}
+      rt dl pthread)
+  if (mklml_FOUND)
+    target_link_libraries(inference "-L${THIRD_PARTY_ROOT}/install/mklml/lib -liomp5 -Wl,--as-needed")
+  endif()
+else()
+  message(FATAL_ERROR "Cannot find PaddlePaddle Fluid under ${PADDLE_ROOT}")
+endif()
--- a/demo/mkldnn_quant/README.md
+++ b/demo/mkldnn_quant/README.md
+# 图像分类INT8量化模型在CPU上的部署和预测
+
+## 概述
+
+本文主要介绍在CPU上转化PaddleSlim产出的量化模型并部署和预测的流程。在Casecade Lake机器上（例如Intel® Xeon® Gold 6271、6248，X2XX等），INT8模型进行推理的速度通常是FP32模型的3-3.7倍；在SkyLake机器（例如Intel® Xeon® Gold 6148、8180，X1XX等）上，使用INT8模型进行推理的速度通常是FP32模型的1.5倍。
+
+流程步骤如下：
+- 产出量化模型：使用PaddleSlim训练并产出量化模型。注意模型中被量化的算子的参数值应该在INT8范围内，但是类型仍为float型。
+- 在CPU上转换量化模型：在CPU上使用DNNL库转化量化模型为INT8模型。
+- 在CPU上部署预测：在CPU上部署样例并进行预测。
+
+## 1. 准备
+
+#### 安装构建PaddleSlim
+
+PaddleSlim 安装请参考[官方安装文档](https://paddlepaddle.github.io/PaddleSlim/install.html)安装
+```
+git clone https://github.com/PaddlePaddle/PaddleSlim.git
+cd PaddleSlim
+python setup.py install
+```
+#### 在代码中使用
+在用户自己的测试样例中，按以下方式导入Paddle和PaddleSlim:
+```
+import paddle
+import paddle.fluid as fluid
+import paddleslim as slim
+import numpy as np
+```
+
+## 2. 用PaddleSlim产出量化模型
+
+用户可以使用PaddleSlim产出量化训练模型或者离线量化模型。如果用户只想要验证部署和预测流程，可以跳过 2.1 和 2.2, 直接下载[mobilenetv2 post-training quant model](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/quant_post_models/mobilenetv2_quant_post.tgz)以及其对应的原始的FP32模型[mobilenetv2 fp32](https://paddle-inference-dist.cdn.bcebos.com/quantizaiton/fp32_models/mobilenetv2.tgz)。如果用户要转化部署自己的模型，请根据下面2.1, 2.2的步骤产出量化模型。
+
+#### 2.1 量化训练
+
+量化训练流程可以参考 [分类模型的量化训练流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
+
+**量化训练过程中config参数：**
+- **quantize_op_types:** 目前CPU上量化支持的算子为 `depthwise_conv2d`, `conv2d`, `fc`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`。但是在量化训练阶段插入fake_quantize/fake_dequantize算子时，只需在前四种op前后插入fake_quantize/fake_dequantize 算子，因为后面四种算子 `transpose2`, `reshape2`, `pool2d`, `scale`, `concat`的scales将从其他op的`out_threshold`属性获取，因此`quantize_op_types`参数只需要设置为 `depthwise_conv2d`, `conv2d`, `fc`, `matmul` 即可。
+- **其他参数:** 请参考 [PaddleSlim quant_aware API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/#quant_aware)
+
+#### 2.2 离线量化
+
+离线量化模型产出可以参考[分类模型的静态离线量化流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
+
+## 3. 转化产出的量化模型为DNNL优化后的INT8模型
+为了部署在CPU上，我们将保存的quant模型，通过一个转化脚本，移除fake_quantize/fake_dequantize op，进行算子融合和优化并且转化为INT8模型。脚本在官网的位置为[save_quant_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_quant_model.py)。复制脚本到本样例所在目录(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`)，并执行如下命令：
+```
+python save_quant_model.py --quant_model_path=/PATH/TO/SAVE/FLOAT32/QUANT/MODEL --int8_model_save_path=/PATH/TO/SAVE/INT8/MODEL
+```
+**参数说明：**
+- **quant_model_path:** 为输入参数，必填。为量化训练产出的quant模型。
+- **int8_model_save_path:** 将quant模型经过DNNL优化量化后保存的最终INT8模型输出路径。注意：quant_model_path必须传入PaddleSlim量化产出的含有fake_quant/fake_dequant ops的quant模型。
+- **ops_to_quantize:** 以逗号隔开的指定的需要量化的op类型列表。可选，默认为空，空表示量化所有可量化的op。目前，对于Benchmark中列出的图像分类和自然语言处理模型中，量化所有可量化的op可以获得最好的精度和性能，因此建议用户不设置这个参数。
+- **--op_ids_to_skip:** 以逗号隔开的op id号列表，可选，默认为空。这个列表中的op号将不量化，采用FP32类型。要获取特定op的ID，请先使用`--debug`选项运行脚本，并打开生成的文件`int8_<number>_cpu_quantize_placement_pass.dot`，找出不需量化的op, ID号在Op名称后面的括号中。
+- **--debug:** 添加此选项可在每个转换步骤之后生成一系列包含模型图的* .dot文件。 有关DOT格式的说明，请参见[DOT](https://graphviz.gitlab.io/_pages/doc/info/lang.html)。要打开`* .dot`文件，请使用系统上可用的任何Graphviz工具（例如Linux上的`xdot`工具或Windows上的`dot`工具有关文档，请参见[Graphviz](http://www.graphviz.org/documentation/)。
+- **注意：**
+  - 目前支持DNNL量化的op列表是`conv2d`, `depthwise_conv2d`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`,`scale`, `concat`。
+  - 如果设置 `--op_ids_to_skip`,只需要传入所有量化op中想要保持FP32类型的op ID号即可。
+  - 有时量化全部op不一定得到最优性能。例如：如果一个op是单个的INT8 op, 之前和之后的op都为float32 op,那么为了量化这个op，需要先做quantize，然后运行INT8 op, 再dequantize, 这样可能导致最终性能不如保持该op为fp32 op。如果用户使用默认设置性能较差，可以观察这个模型是否有单独的INT8 op，选出不同的`ops_to_quantize`组合，也可以通过`--op_ids_to_skip`排除部分可量化op ID，多运行几次获得最佳设置。
+
+## 4. 预测
+
+### 4.1 数据预处理转化
+在精度和性能预测中，需要先对数据进行二进制转化。运行脚本如下可转化完整ILSVRC2012 val数据集。使用`--local`可以转化用户自己的数据。在Paddle所在目录运行下面的脚本。脚本在官网位置为[full_ILSVRC2012_val_preprocess.py](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py)
+```
+python Paddle/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET/  --output_file=/PATH/TO/SAVE/BINARY/FILE
+```
+
+可选参数：
+- 不设置任何参数。脚本将下载 ILSVRC2012_img_val数据集，并转化为二进制文件。
+- **local:** 设置便为true，表示用户将提供自己的数据
+- **data_dir:** 用户自己的数据目录
+- **label_list:** 图片路径-图片类别列表文件，类似于`val_list.txt`
+- **output_file:** 生成的binary文件路径。
+- **data_dim:** 预处理图片的长和宽。默认值 224。
+
+用户自己的数据集目录结构应该如下
+```
+imagenet_user
+├── val
+│   ├── ILSVRC2012_val_00000001.jpg
+│   ├── ILSVRC2012_val_00000002.jpg
+|   |── ...
+└── val_list.txt
+```
+其中，val_list.txt 内容应该如下：
+```
+val/ILSVRC2012_val_00000001.jpg 0
+val/ILSVRC2012_val_00000002.jpg 0
+```
+
+注意：
+- 为什么将数据集转化为二进制文件？因为paddle中的数据预处理（resize, crop等）都使用pythong.Image模块进行，训练出的模型也是基于Python预处理的图片，但是我们发现Python测试性能开销很大，导致预测性能下降。为了获得良好性能，在量化模型预测阶段，我们决定使用C++测试，而C++只支持Open-CV等库，Paddle不建议使用外部库，因此我们使用Python将图片预处理然后放入二进制文件，再在C++测试中读出。用户根据自己的需要，可以更改C++测试以直接读数据并预处理，精度不会有太大下降。我们还提供了python测试`sample_tester.py`作为参考，与C++测试`sample_tester.cc`相比，用户可以看到Python测试更大的性能开销。
+
+### 4.2 部署预测
+
+#### 部署前提
+- 用户可以通过在命令行红输入`lscpu`查看本机支持指令。
+- 在支持`avx512_vnni`的CPU服务器上，INT8精度和性能最高，如：Casecade Lake, Model name: Intel(R) Xeon(R) Gold X2XX，INT8性能提升为FP32模型的3~3.7倍
+- 在支持`avx512`但是不支持`avx512_vnni`的CPU服务器上，如：SkyLake, Model name：Intel(R) Xeon(R) Gold X1XX，INT8性能为FP32性能的1.5倍左右。
+
+#### 准备预测推理库
+
+用户可以从源码编译Paddle推理库，也可以直接下载推理库。
+- 用户可以从Paddle源码编译Paddle推理库，参考[从源码编译](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html#id12)，使用release/2.0以上版本。
+
+- 用户也可以从Paddle官网下载发布的[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)。请选择`ubuntu14.04_cpu_avx_mkl` 最新发布版或者develop版。
+
+你可以将准备好的预测库解压并重命名为fluid_inference，放在当前目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`)。或者在cmake时通过设置PADDLE_ROOT来指定Paddle预测库的位置。
+
+#### 编译应用
+样例所在目录为PaddleSlim下`demo/mkldnn_quant/`,样例`sample_tester.cc`和编译所需`cmake`文件夹都在这个目录下。
+```
+cd /PATH/TO/PaddleSlim
+cd demo/mkldnn_quant/
+mkdir build
+cd build
+cmake -DPADDLE_ROOT=$PADDLE_ROOT ..
+make -j
+```
+如果你从官网下载解压了[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)到当前目录下，这里`-DPADDLE_ROOT`可以不设置，因为`-DPADDLE_ROOT`默认位置`demo/mkldnn_quant/fluid_inference`
+
+#### 运行测试
+```
+# Bind threads to cores
+export KMP_AFFINITY=granularity=fine,compact,1,0
+export KMP_BLOCKTIME=1
+# Turbo Boost could be set to OFF using the command
+echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
+# In the file run.sh, set `MODEL_DIR` to `/PATH/TO/FLOAT32/MODEL`或者`/PATH/TO/SAVE/INT8/MODEL`
+# In the file run.sh, set `DATA_FILE` to `/PATH/TO/SAVE/BINARY/FILE`
+# For 1 thread performance:
+./run.sh
+# For 20 thread performance:
+./run.sh -1 20
+```
+
+运行时需要配置以下参数：
+- **infer_model:** 模型所在目录，注意模型参数当前必须是分开保存成多个文件的。可以设置为`PATH/TO/SAVE/INT8/MODEL`, `PATH/TO/SAVE/FLOAT32/MODEL`。无默认值。
+- **infer_data:** 测试数据文件所在路径。注意需要是经`full_ILSVRC2012_val_preprocess`转化后的binary文件。
+- **batch_size:** 预测batch size大小。默认值为50。
+- **iterations:** batches迭代数。默认为0，0表示预测infer_data中所有batches (image numbers/batch_size)
+- **num_threads:** 预测使用CPU 线程数，默认为单核一个线程。
+- **with_accuracy_layer:** 模型为包含精度计算层的测试模型还是不包含精度计算层的预测模型，默认为true。
+- **use_analysis** 是否使用`paddle::AnalysisConfig`对模型优化、融合(fuse)，加速。默认为false
+
+你可以直接修改`/PATH_TO_PaddleSlim/demo/mkldnn_quant/`目录下的`run.sh`中的MODEL_DIR和DATA_DIR，即可执行`./run.sh`进行CPU预测。
+
+### 4.3 用户编写自己的测试：
+如果用户编写自己的测试：
+1. 测试INT8模型
+    如果用户测试转化好的INT8模型，使用 `paddle::NativeConfig` 即可测试。在demo中，设置`use_analysis`为`false`。
+2. 测试FP32模型
+   如果用户要测试PF32模型，使用`paddle::AnalysisConfig`对原始FP32模型先优化（fuses等）再测试。在样例中，直接设置`use_analysis`为`true`。AnalysisConfig设置如下：
+```
+static void SetConfig(paddle::AnalysisConfig *cfg) {
+  cfg->SetModel(FLAGS_infer_model);  // 必须。表示需要测试的模型
+  cfg->DisableGpu();      // 必须。部署在CPU上预测，必须Disablegpu
+  cfg->EnableMKLDNN();  //必须。表示使用MKLDNN算子，将比 native 快
+  cfg->SwitchIrOptim();   // 如果传入FP32原始，这个配置设置为true将优化加速模型
+  cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads);  //非必须。默认设置为1。表示多线程运行
+}
+```
+- 在我们提供的样例中，只要设置`use_analysis`为true并且`infer_model`传入原始FP32模型，AnalysisConfig的上述设置将被执行，传入的FP32模型将被DNNL优化加速（包括fuses等）。
+- 如果infer_model传入INT8模型，则 `use_analysis`将不起任何作用，因为INT8模型已经被优化量化。
+- 如果infer_model传入PaddleSlim产出的quant模型，`use_analysis`即使设置为true不起作用，因为quant模型包含fake_quantize/fake_dequantize ops,无法fuse,无法优化。
+
+## 5. 精度和性能数据
+INT8模型精度和性能结果参考[CPU部署预测INT8模型的精度和性能](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md)
+
+## FAQ
+
+- 自然语言处理模型在CPU上的部署和预测参考样例[ERNIE 模型 QUANT INT8 精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c++/ernie/mkldnn)
+- 具体DNNL量化原理可以查看[SLIM Quant for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/README.md)。
--- a/demo/mkldnn_quant/README_en.md
+++ b/demo/mkldnn_quant/README_en.md
--- a/demo/mkldnn_quant/cmake/FindFluid.cmake
+++ b/demo/mkldnn_quant/cmake/FindFluid.cmake
--- a/demo/mkldnn_quant/cmake/FindGperftools.cmake
+++ b/demo/mkldnn_quant/cmake/FindGperftools.cmake
--- a/demo/mkldnn_quant/run.sh
+++ b/demo/mkldnn_quant/run.sh
+#!/bin/bash
+MODEL_DIR=./mobilenetv2_INT8
+DATA_FILE=/data/datasets/ImageNet_py/val.bin
+num_threads=1
+with_accuracy_layer=false
+use_profile=true
+ITERATIONS=0
+
+./build/inference --logtostderr=1 \
+    --infer_model=${MODEL_DIR} \
+    --infer_data=${DATA_FILE} \
+    --batch_size=1 \
+    --num_threads=${num_threads} \
+    --iterations=${ITERATIONS} \
+    --with_accuracy_layer=${with_accuracy_layer} \
+    --use_profile=${use_profile} \
+    --use_analysis=false
--- a/demo/mkldnn_quant/sample_tester.cc
+++ b/demo/mkldnn_quant/sample_tester.cc
--- a/demo/mkldnn_quant/sample_tester.py
+++ b/demo/mkldnn_quant/sample_tester.py
--- a/demo/models/__init__.py
+++ b/demo/models/__init__.py
+from __future__ import absolute_import
 from .mobilenet import MobileNet
 from .resnet import ResNet34, ResNet50
-from .mobilenet_v2 import MobileNetV2
+from .resnet_vd import ResNet50_vd, ResNet101_vd
+from .mobilenet_v2 import MobileNetV2_x0_25, MobileNetV2
+from .pvanet import PVANet
+from .slimfacenet import SlimFaceNet_A_x0_60, SlimFaceNet_B_x0_75, SlimFaceNet_C_x0_75
+from .mobilenet_v3 import *
+__all__ = [
+    "model_list", "MobileNet", "ResNet34", "ResNet50", "MobileNetV2", "PVANet",
+    "ResNet50_vd", "ResNet101_vd", "MobileNetV2_x0_25"
+]
+model_list = [
+    'MobileNet', 'ResNet34', 'ResNet50', 'MobileNetV2', 'PVANet',
+    'ResNet50_vd', "ResNet101_vd", "MobileNetV2_x0_25"
+]

-__all__ = ['MobileNet', 'ResNet34', 'ResNet50', 'MobileNetV2']
+__all__ += mobilenet_v3.__all__
+model_list += mobilenet_v3.__all__
--- a/demo/models/mobilenet.py
+++ b/demo/models/mobilenet.py
@@ -127,13 +127,14 @@ class MobileNet():
            pool_stride=1,
            pool_type='avg',
            global_pooling=True)
-
-        output = fluid.layers.fc(input=input,
-                                 size=class_dim,
-                                 act='softmax',
-                                 param_attr=ParamAttr(
-                                     initializer=MSRA(), name="fc7_weights"),
-                                 bias_attr=ParamAttr(name="fc7_offset"))
+        with fluid.name_scope('last_fc'):
+            output = fluid.layers.fc(input=input,
+                                     size=class_dim,
+                                     act='softmax',
+                                     param_attr=ParamAttr(
+                                         initializer=MSRA(),
+                                         name="fc7_weights"),
+                                     bias_attr=ParamAttr(name="fc7_offset"))

        return output


--- a/demo/models/mobilenet_v3.py
+++ b/demo/models/mobilenet_v3.py
--- a/demo/models/pvanet.py
+++ b/demo/models/pvanet.py
--- a/demo/models/resnet_vd.py
+++ b/demo/models/resnet_vd.py
--- a/demo/models/slimfacenet.py
+++ b/demo/models/slimfacenet.py
--- a/demo/nas/README.md
+++ b/demo/nas/README.md
--- a/demo/nas/block_sa_nas_mobilenetv2.py
+++ b/demo/nas/block_sa_nas_mobilenetv2.py
--- a/demo/nas/darts_cifar10_reader.py
+++ b/demo/nas/darts_cifar10_reader.py
--- a/demo/nas/image_classification_nas_quick_start.ipynb
+++ b/demo/nas/image_classification_nas_quick_start.ipynb
--- a/demo/nas/parl_nas_mobilenetv2.py
+++ b/demo/nas/parl_nas_mobilenetv2.py
--- a/demo/nas/rl_nas_mobilenetv2.py
+++ b/demo/nas/rl_nas_mobilenetv2.py
--- a/demo/nas/sa_nas_mobilenetv2.py
+++ b/demo/nas/sa_nas_mobilenetv2.py
--- a/demo/nas/sanas_darts_space.py
+++ b/demo/nas/sanas_darts_space.py
--- a/PaddleOCR @ 56c6c3ae
+++ b/PaddleOCR @ 56c6c3ae
+Subproject commit 56c6c3ae0e5c9ae6b9401a9446c629e513d4617f
--- a/demo/ocr/README.md
+++ b/demo/ocr/README.md
--- a/demo/one_shot/ofa_train.py
+++ b/demo/one_shot/ofa_train.py
--- a/demo/one_shot/train.py
+++ b/demo/one_shot/train.py
--- a/demo/optimizer.py
+++ b/demo/optimizer.py
--- a/demo/pantheon/lexical_anlysis/README.md
+++ b/demo/pantheon/lexical_anlysis/README.md
--- a/demo/pantheon/lexical_anlysis/README_cn.md
+++ b/demo/pantheon/lexical_anlysis/README_cn.md
--- a/demo/pantheon/lexical_anlysis/__init__.py
+++ b/demo/pantheon/lexical_anlysis/__init__.py
--- a/demo/pantheon/lexical_anlysis/creator.py
+++ b/demo/pantheon/lexical_anlysis/creator.py
--- a/demo/pantheon/lexical_anlysis/downloads.py
+++ b/demo/pantheon/lexical_anlysis/downloads.py
--- a/demo/pantheon/lexical_anlysis/ernie_reader.py
+++ b/demo/pantheon/lexical_anlysis/ernie_reader.py
--- a/demo/pantheon/lexical_anlysis/eval.py
+++ b/demo/pantheon/lexical_anlysis/eval.py
--- a/demo/pantheon/lexical_anlysis/model_utils.py
+++ b/demo/pantheon/lexical_anlysis/model_utils.py
--- a/demo/pantheon/lexical_anlysis/models/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/model_check.py
+++ b/demo/pantheon/lexical_anlysis/models/model_check.py
--- a/demo/pantheon/lexical_anlysis/models/representation/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/representation/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/representation/ernie.py
+++ b/demo/pantheon/lexical_anlysis/models/representation/ernie.py
--- a/demo/pantheon/lexical_anlysis/models/sequence_labeling/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/sequence_labeling/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/sequence_labeling/nets.py
+++ b/demo/pantheon/lexical_anlysis/models/sequence_labeling/nets.py
--- a/demo/pantheon/lexical_anlysis/models/transformer_encoder.py
+++ b/demo/pantheon/lexical_anlysis/models/transformer_encoder.py
--- a/demo/pantheon/lexical_anlysis/preprocess/__init__.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/__init__.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/__init__.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/__init__.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/task_reader.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/task_reader.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/tokenization.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/tokenization.py
--- a/demo/pantheon/lexical_anlysis/preprocess/padding.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/padding.py
--- a/demo/pantheon/lexical_anlysis/reader.py
+++ b/demo/pantheon/lexical_anlysis/reader.py
--- a/demo/pantheon/lexical_anlysis/run_student.sh
+++ b/demo/pantheon/lexical_anlysis/run_student.sh
--- a/demo/pantheon/lexical_anlysis/run_teacher.sh
+++ b/demo/pantheon/lexical_anlysis/run_teacher.sh
--- a/demo/pantheon/lexical_anlysis/teacher_ernie.py
+++ b/demo/pantheon/lexical_anlysis/teacher_ernie.py
--- a/demo/pantheon/lexical_anlysis/train_student.py
+++ b/demo/pantheon/lexical_anlysis/train_student.py
--- a/demo/pantheon/toy/README.md
+++ b/demo/pantheon/toy/README.md
--- a/demo/pantheon/toy/run_student.py
+++ b/demo/pantheon/toy/run_student.py
--- a/demo/pantheon/toy/run_teacher1.py
+++ b/demo/pantheon/toy/run_teacher1.py
--- a/demo/pantheon/toy/run_teacher2.py
+++ b/demo/pantheon/toy/run_teacher2.py
--- a/demo/pantheon/toy/utils.py
+++ b/demo/pantheon/toy/utils.py
--- a/demo/prune/README.md
+++ b/demo/prune/README.md
--- a/demo/prune/eval.py
+++ b/demo/prune/eval.py
--- a/demo/prune/fpgm_mobilenetv1_f-50_train.sh
+++ b/demo/prune/fpgm_mobilenetv1_f-50_train.sh
--- a/demo/prune/fpgm_mobilenetv2_f-50_train.sh
+++ b/demo/prune/fpgm_mobilenetv2_f-50_train.sh
--- a/demo/prune/fpgm_resnet34_f-42_train.sh
+++ b/demo/prune/fpgm_resnet34_f-42_train.sh
--- a/demo/prune/fpgm_resnet34_f-50_train.sh
+++ b/demo/prune/fpgm_resnet34_f-50_train.sh
--- a/demo/prune/image_classification_pruning_quick_start.ipynb
+++ b/demo/prune/image_classification_pruning_quick_start.ipynb
--- a/demo/prune/train.py
+++ b/demo/prune/train.py
--- a/demo/quant/pact_quant_aware/README.md
+++ b/demo/quant/pact_quant_aware/README.md
--- a/demo/quant/pact_quant_aware/image/activation_dist.png
+++ b/demo/quant/pact_quant_aware/image/activation_dist.png
--- a/demo/quant/pact_quant_aware/image/pact.png
+++ b/demo/quant/pact_quant_aware/image/pact.png
--- a/demo/quant/pact_quant_aware/image/pact_our.png
+++ b/demo/quant/pact_quant_aware/image/pact_our.png
--- a/demo/quant/pact_quant_aware/pact.py
+++ b/demo/quant/pact_quant_aware/pact.py
--- a/demo/quant/pact_quant_aware/train.py
+++ b/demo/quant/pact_quant_aware/train.py
--- a/demo/quant/quant_aware/README.md
+++ b/demo/quant/quant_aware/README.md
--- a/demo/quant/quant_aware/image_classification_training_aware_quantization_quick_start.ipynb
+++ b/demo/quant/quant_aware/image_classification_training_aware_quantization_quick_start.ipynb
--- a/demo/quant/quant_aware/train.py
+++ b/demo/quant/quant_aware/train.py
--- a/demo/quant/quant_embedding/README.md
+++ b/demo/quant/quant_embedding/README.md
--- a/demo/quant/quant_embedding/infer.py
+++ b/demo/quant/quant_embedding/infer.py
--- a/demo/quant/quant_post/README.md
+++ b/demo/quant/quant_post/README.md
--- a/demo/quant/quant_post/eval.py
+++ b/demo/quant/quant_post/eval.py
--- a/demo/quant/quant_post/export_model.py
+++ b/demo/quant/quant_post/export_model.py
--- a/demo/quant/quant_post/image_classification_post_training_quantization_quick_start.ipynb
+++ b/demo/quant/quant_post/image_classification_post_training_quantization_quick_start.ipynb
--- a/demo/quant/quant_post/quant_post.py
+++ b/demo/quant/quant_post/quant_post.py
--- a/demo/sensitive/image_classification_sensitivity_analysis.ipynb
+++ b/demo/sensitive/image_classification_sensitivity_analysis.ipynb
--- a/demo/sensitive/train.py
+++ b/demo/sensitive/train.py
--- a/demo/sensitive_prune/greedy_prune.py
+++ b/demo/sensitive_prune/greedy_prune.py
--- a/demo/sensitive_prune/prune.py
+++ b/demo/sensitive_prune/prune.py
--- a/demo/slimfacenet/README.md
+++ b/demo/slimfacenet/README.md
--- a/demo/slimfacenet/dataloader/__init__.py
+++ b/demo/slimfacenet/dataloader/__init__.py
--- a/demo/slimfacenet/dataloader/casia.py
+++ b/demo/slimfacenet/dataloader/casia.py
--- a/demo/slimfacenet/dataloader/lfw.py
+++ b/demo/slimfacenet/dataloader/lfw.py
--- a/demo/slimfacenet/lfw_eval.py
+++ b/demo/slimfacenet/lfw_eval.py
--- a/demo/slimfacenet/slim_eval.sh
+++ b/demo/slimfacenet/slim_eval.sh
--- a/demo/slimfacenet/slim_quant.sh
+++ b/demo/slimfacenet/slim_quant.sh
--- a/demo/slimfacenet/slim_train.sh
+++ b/demo/slimfacenet/slim_train.sh
--- a/demo/slimfacenet/train_eval.py
+++ b/demo/slimfacenet/train_eval.py
--- a/docs/README.md
+++ b/docs/README.md
--- a/docs/docs/api/analysis_api.md
+++ b/docs/docs/api/analysis_api.md
--- a/docs/docs/api/api_guide.md
+++ b/docs/docs/api/api_guide.md
--- a/docs/docs/api/nas_api.md
+++ b/docs/docs/api/nas_api.md
--- a/docs/docs/api/prune_api.md
+++ b/docs/docs/api/prune_api.md
--- a/docs/docs/api/quantization_api.md
+++ b/docs/docs/api/quantization_api.md
--- a/docs/docs/api/single_distiller_api.md
+++ b/docs/docs/api/single_distiller_api.md
--- a/docs/docs/index.md
+++ b/docs/docs/index.md
--- a/docs/docs/tutorials/demo_guide.md
+++ b/docs/docs/tutorials/demo_guide.md
--- a/docs/docs/tutorials/distillation_demo.md
+++ b/docs/docs/tutorials/distillation_demo.md
--- a/docs/docs/tutorials/nas_demo.md
+++ b/docs/docs/tutorials/nas_demo.md
--- a/docs/docs/tutorials/quant_aware_demo.md
+++ b/docs/docs/tutorials/quant_aware_demo.md
--- a/docs/docs/tutorials/quant_embedding_demo.md
+++ b/docs/docs/tutorials/quant_embedding_demo.md
--- a/docs/docs/tutorials/quant_post_demo.md
+++ b/docs/docs/tutorials/quant_post_demo.md
--- a/docs/docs/tutorials/sensitivity_demo.md
+++ b/docs/docs/tutorials/sensitivity_demo.md
--- a/docs/en/Makefile
+++ b/docs/en/Makefile
--- a/docs/en/api_en/index_en.rst
+++ b/docs/en/api_en/index_en.rst
--- a/docs/en/api_en/modules.rst
+++ b/docs/en/api_en/modules.rst
--- a/docs/en/api_en/paddleslim.analysis.rst
+++ b/docs/en/api_en/paddleslim.analysis.rst
--- a/docs/en/api_en/paddleslim.common.rst
+++ b/docs/en/api_en/paddleslim.common.rst
--- a/docs/en/api_en/paddleslim.core.rst
+++ b/docs/en/api_en/paddleslim.core.rst
--- a/docs/en/api_en/paddleslim.dist.rst
+++ b/docs/en/api_en/paddleslim.dist.rst
--- a/docs/en/api_en/paddleslim.models.rst
+++ b/docs/en/api_en/paddleslim.models.rst
--- a/docs/en/api_en/paddleslim.nas.darts.rst
+++ b/docs/en/api_en/paddleslim.nas.darts.rst
--- a/docs/en/api_en/paddleslim.nas.one_shot.rst
+++ b/docs/en/api_en/paddleslim.nas.one_shot.rst
--- a/docs/en/api_en/paddleslim.nas.rst
+++ b/docs/en/api_en/paddleslim.nas.rst
--- a/docs/en/api_en/paddleslim.pantheon.rst
+++ b/docs/en/api_en/paddleslim.pantheon.rst
--- a/docs/en/api_en/paddleslim.prune.rst
+++ b/docs/en/api_en/paddleslim.prune.rst
--- a/docs/en/api_en/paddleslim.quant.rst
+++ b/docs/en/api_en/paddleslim.quant.rst
--- a/docs/en/api_en/paddleslim.rst
+++ b/docs/en/api_en/paddleslim.rst
--- a/demo/nas/search_space_doc.md
+++ b/demo/nas/search_space_doc.md
--- a/docs/en/api_en/table_latency_en.md
+++ b/docs/en/api_en/table_latency_en.md
--- a/docs/en/conf.py
+++ b/docs/en/conf.py
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
--- a/docs/en/install_en.md
+++ b/docs/en/install_en.md
--- a/docs/en/intro_en.md
+++ b/docs/en/intro_en.md
--- a/docs/en/model_zoo_en.md
+++ b/docs/en/model_zoo_en.md
--- a/docs/en/quick_start/distillation_tutorial_en.md
+++ b/docs/en/quick_start/distillation_tutorial_en.md
--- a/docs/en/quick_start/index_en.rst
+++ b/docs/en/quick_start/index_en.rst
--- a/docs/en/quick_start/nas_tutorial_en.md
+++ b/docs/en/quick_start/nas_tutorial_en.md
--- a/docs/en/quick_start/pruning_tutorial_en.md
+++ b/docs/en/quick_start/pruning_tutorial_en.md
--- a/docs/en/quick_start/quant_aware_tutorial_en.md
+++ b/docs/en/quick_start/quant_aware_tutorial_en.md
--- a/docs/en/quick_start/quant_post_static_tutorial_en.md
+++ b/docs/en/quick_start/quant_post_static_tutorial_en.md
--- a/docs/en/tutorials/image_classification_sensitivity_analysis_tutorial_en.md
+++ b/docs/en/tutorials/image_classification_sensitivity_analysis_tutorial_en.md
--- a/docs/en/tutorials/index_en.rst
+++ b/docs/en/tutorials/index_en.rst
--- a/docs/docs/images/algo/distillation_0.png
+++ b/docs/docs/images/algo/distillation_0.png
--- a/docs/docs/images/algo/light-nas-block.png
+++ b/docs/docs/images/algo/light-nas-block.png
--- a/docs/docs/images/algo/pruning_0.png
+++ b/docs/docs/images/algo/pruning_0.png
--- a/docs/docs/images/algo/pruning_1.png
+++ b/docs/docs/images/algo/pruning_1.png
--- a/docs/docs/images/algo/pruning_2.png
+++ b/docs/docs/images/algo/pruning_2.png
--- a/docs/docs/images/algo/pruning_3.png
+++ b/docs/docs/images/algo/pruning_3.png
--- a/docs/docs/images/algo/pruning_4.png
+++ b/docs/docs/images/algo/pruning_4.png
--- a/docs/docs/images/algo/quan_bwd.png
+++ b/docs/docs/images/algo/quan_bwd.png
--- a/docs/docs/images/algo/quan_forward.png
+++ b/docs/docs/images/algo/quan_forward.png
--- a/docs/docs/images/algo/quan_fwd_1.png
+++ b/docs/docs/images/algo/quan_fwd_1.png
--- a/docs/docs/images/algo/quan_table_0.png
+++ b/docs/docs/images/algo/quan_table_0.png
--- a/docs/docs/images/algo/quan_table_1.png
+++ b/docs/docs/images/algo/quan_table_1.png
--- a/docs/images/framework_0.png
+++ b/docs/images/framework_0.png
--- a/docs/images/framework_1.png
+++ b/docs/images/framework_1.png
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
--- a/docs/zh_cn/CHANGELOG.md
+++ b/docs/zh_cn/CHANGELOG.md
--- a/docs/zh_cn/FAQ/index.rst
+++ b/docs/zh_cn/FAQ/index.rst
--- a/docs/zh_cn/FAQ/quantization_FAQ.md
+++ b/docs/zh_cn/FAQ/quantization_FAQ.md
--- a/docs/zh_cn/Makefile
+++ b/docs/zh_cn/Makefile
--- a/docs/docs/algo/algo.md
+++ b/docs/docs/algo/algo.md
--- a/docs/zh_cn/api_cn/analysis_api.rst
+++ b/docs/zh_cn/api_cn/analysis_api.rst
--- a/docs/zh_cn/api_cn/common_index.rst
+++ b/docs/zh_cn/api_cn/common_index.rst
--- a/docs/zh_cn/api_cn/custom_rl_controller.md
+++ b/docs/zh_cn/api_cn/custom_rl_controller.md
--- a/docs/zh_cn/api_cn/darts.rst
+++ b/docs/zh_cn/api_cn/darts.rst
--- a/docs/zh_cn/api_cn/distill_index.rst
+++ b/docs/zh_cn/api_cn/distill_index.rst
--- a/docs/zh_cn/api_cn/early_stop.rst
+++ b/docs/zh_cn/api_cn/early_stop.rst
--- a/docs/zh_cn/api_cn/index.rst
+++ b/docs/zh_cn/api_cn/index.rst
--- a/docs/zh_cn/api_cn/nas_api.rst
+++ b/docs/zh_cn/api_cn/nas_api.rst
--- a/docs/zh_cn/api_cn/nas_index.rst
+++ b/docs/zh_cn/api_cn/nas_index.rst
--- a/docs/zh_cn/api_cn/one_shot_api.rst
+++ b/docs/zh_cn/api_cn/one_shot_api.rst
--- a/docs/zh_cn/api_cn/pantheon_api.md
+++ b/docs/zh_cn/api_cn/pantheon_api.md
--- a/docs/zh_cn/api_cn/prune_api.rst
+++ b/docs/zh_cn/api_cn/prune_api.rst
--- a/docs/zh_cn/api_cn/prune_index.rst
+++ b/docs/zh_cn/api_cn/prune_index.rst
--- a/docs/zh_cn/api_cn/quant_index.rst
+++ b/docs/zh_cn/api_cn/quant_index.rst
--- a/docs/zh_cn/api_cn/quantization_api.rst
+++ b/docs/zh_cn/api_cn/quantization_api.rst
--- a/docs/docs/api/search_space.md
+++ b/docs/docs/api/search_space.md
--- a/docs/zh_cn/api_cn/single_distiller_api.rst
+++ b/docs/zh_cn/api_cn/single_distiller_api.rst
--- a/docs/docs/table_latency.md
+++ b/docs/docs/table_latency.md
--- a/docs/zh_cn/conf.py
+++ b/docs/zh_cn/conf.py
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
--- a/docs/zh_cn/install.md
+++ b/docs/zh_cn/install.md
--- a/docs/zh_cn/intro.md
+++ b/docs/zh_cn/intro.md
--- a/docs/zh_cn/model_zoo.md
+++ b/docs/zh_cn/model_zoo.md
--- a/docs/zh_cn/model_zoo/distillation_model_zoo.md
+++ b/docs/zh_cn/model_zoo/distillation_model_zoo.md
--- a/docs/zh_cn/model_zoo/index.rst
+++ b/docs/zh_cn/model_zoo/index.rst
--- a/docs/zh_cn/model_zoo/model_zoo.md
+++ b/docs/zh_cn/model_zoo/model_zoo.md
--- a/docs/zh_cn/model_zoo/nas_model_zoo.md
+++ b/docs/zh_cn/model_zoo/nas_model_zoo.md
--- a/docs/zh_cn/model_zoo/prune_model_zoo.md
+++ b/docs/zh_cn/model_zoo/prune_model_zoo.md
--- a/docs/zh_cn/model_zoo/quant_model_zoo.md
+++ b/docs/zh_cn/model_zoo/quant_model_zoo.md
--- a/docs/zh_cn/quick_start/distillation_tutorial.md
+++ b/docs/zh_cn/quick_start/distillation_tutorial.md
--- a/docs/zh_cn/quick_start/index.rst
+++ b/docs/zh_cn/quick_start/index.rst
--- a/docs/zh_cn/quick_start/nas_tutorial.md
+++ b/docs/zh_cn/quick_start/nas_tutorial.md
--- a/docs/zh_cn/quick_start/pruning_tutorial.md
+++ b/docs/zh_cn/quick_start/pruning_tutorial.md
--- a/docs/zh_cn/quick_start/quant_aware_tutorial.md
+++ b/docs/zh_cn/quick_start/quant_aware_tutorial.md
--- a/docs/zh_cn/quick_start/quant_post_static_tutorial.md
+++ b/docs/zh_cn/quick_start/quant_post_static_tutorial.md
--- a/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_mkldnn_quant_tutorial.md
--- a/docs/zh_cn/tutorials/image_classification_nas_quick_start.ipynb
+++ b/docs/zh_cn/tutorials/image_classification_nas_quick_start.ipynb
--- a/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md
--- a/docs/zh_cn/tutorials/index.rst
+++ b/docs/zh_cn/tutorials/index.rst
--- a/docs/zh_cn/tutorials/paddledetection_slim_distillation_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_distillation_tutorial.md
--- a/docs/zh_cn/tutorials/paddledetection_slim_nas_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_nas_tutorial.md
--- a/docs/zh_cn/tutorials/paddledetection_slim_pruing_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_pruing_tutorial.md
--- a/docs/zh_cn/tutorials/paddledetection_slim_prune_dist_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_prune_dist_tutorial.md
--- a/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md
--- a/docs/zh_cn/tutorials/paddledetection_slim_sensitivy_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_sensitivy_tutorial.md
--- a/docs/zh_cn/tutorials/sanas_darts_space.ipynb
+++ b/docs/zh_cn/tutorials/sanas_darts_space.ipynb
--- a/docs/zh_cn/tutorials/sanas_darts_space.md
+++ b/docs/zh_cn/tutorials/sanas_darts_space.md
--- a/docs/zh_cn/tutorials/sensitivity_tutorial.md
+++ b/docs/zh_cn/tutorials/sensitivity_tutorial.md
--- a/paddleslim/__init__.py
+++ b/paddleslim/__init__.py
--- a/paddleslim/analysis/flops.py
+++ b/paddleslim/analysis/flops.py
--- a/paddleslim/analysis/latency.py
+++ b/paddleslim/analysis/latency.py
--- a/paddleslim/analysis/model_size.py
+++ b/paddleslim/analysis/model_size.py
--- a/paddleslim/common/__init__.py
+++ b/paddleslim/common/__init__.py
--- a/paddleslim/common/analyze_helper.py
+++ b/paddleslim/common/analyze_helper.py
--- a/paddleslim/common/cached_reader.py
+++ b/paddleslim/common/cached_reader.py
--- a/paddleslim/common/client.py
+++ b/paddleslim/common/client.py
--- a/paddleslim/common/controller.py
+++ b/paddleslim/common/controller.py
--- a/paddleslim/common/controller_client.py
+++ b/paddleslim/common/controller_client.py
--- a/paddleslim/common/controller_server.py
+++ b/paddleslim/common/controller_server.py
--- a/paddleslim/prune/lock.py
+++ b/paddleslim/prune/lock.py
--- a/paddleslim/common/log_helper.py
+++ b/paddleslim/common/log_helper.py
--- a/paddleslim/common/meter.py
+++ b/paddleslim/common/meter.py
--- a/paddleslim/common/rl_controller/__init__.py
+++ b/paddleslim/common/rl_controller/__init__.py
--- a/paddleslim/common/lock_utils.py
+++ b/paddleslim/common/lock_utils.py
--- a/paddleslim/common/rl_controller/ddpg/__init__.py
+++ b/paddleslim/common/rl_controller/ddpg/__init__.py
--- a/paddleslim/common/rl_controller/ddpg/ddpg_controller.py
+++ b/paddleslim/common/rl_controller/ddpg/ddpg_controller.py
--- a/paddleslim/common/rl_controller/ddpg/ddpg_model.py
+++ b/paddleslim/common/rl_controller/ddpg/ddpg_model.py
--- a/paddleslim/common/rl_controller/ddpg/noise.py
+++ b/paddleslim/common/rl_controller/ddpg/noise.py
--- a/paddleslim/common/rl_controller/lstm/__init__.py
+++ b/paddleslim/common/rl_controller/lstm/__init__.py
--- a/paddleslim/common/rl_controller/lstm/lstm_controller.py
+++ b/paddleslim/common/rl_controller/lstm/lstm_controller.py
--- a/paddleslim/common/rl_controller/utils.py
+++ b/paddleslim/common/rl_controller/utils.py
--- a/paddleslim/common/sa_controller.py
+++ b/paddleslim/common/sa_controller.py
--- a/paddleslim/common/server.py
+++ b/paddleslim/common/server.py
--- a/paddleslim/core/graph_wrapper.py
+++ b/paddleslim/core/graph_wrapper.py
--- a/paddleslim/core/registry.py
+++ b/paddleslim/core/registry.py
--- a/paddleslim/dist/__init__.py
+++ b/paddleslim/dist/__init__.py
--- a/paddleslim/dist/dml.py
+++ b/paddleslim/dist/dml.py
--- a/paddleslim/dist/mp_distiller.py
+++ b/paddleslim/dist/mp_distiller.py
--- a/paddleslim/dist/single_distiller.py
+++ b/paddleslim/dist/single_distiller.py
--- a/paddleslim/models/README.md
+++ b/paddleslim/models/README.md
--- a/paddleslim/models/__init__.py
+++ b/paddleslim/models/__init__.py
--- a/paddleslim/models/classification_models.py
+++ b/paddleslim/models/classification_models.py
--- a/paddleslim/models/dygraph/__init__.py
+++ b/paddleslim/models/dygraph/__init__.py
--- a/paddleslim/models/dygraph/mobilenet.py
+++ b/paddleslim/models/dygraph/mobilenet.py
--- a/paddleslim/models/dygraph/resnet.py
+++ b/paddleslim/models/dygraph/resnet.py
--- a/paddleslim/models/mobilenet.py
+++ b/paddleslim/models/mobilenet.py
--- a/paddleslim/models/mobilenet_v2.py
+++ b/paddleslim/models/mobilenet_v2.py
--- a/paddleslim/models/resnet.py
+++ b/paddleslim/models/resnet.py
--- a/paddleslim/models/slim_mobilenet.py
+++ b/paddleslim/models/slim_mobilenet.py
--- a/paddleslim/models/slimfacenet.py
+++ b/paddleslim/models/slimfacenet.py
--- a/paddleslim/models/util.py
+++ b/paddleslim/models/util.py
--- a/paddleslim/nas/__init__.py
+++ b/paddleslim/nas/__init__.py
--- a/paddleslim/nas/darts/__init__.py
+++ b/paddleslim/nas/darts/__init__.py
--- a/paddleslim/nas/darts/architect.py
+++ b/paddleslim/nas/darts/architect.py
--- a/paddleslim/nas/darts/architect_for_bert.py
+++ b/paddleslim/nas/darts/architect_for_bert.py
--- a/paddleslim/nas/darts/get_genotype.py
+++ b/paddleslim/nas/darts/get_genotype.py
--- a/paddleslim/nas/darts/search_space/__init__.py
+++ b/paddleslim/nas/darts/search_space/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/bert.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/bert.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/transformer_encoder.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/transformer_encoder.py
--- a/paddleslim/nas/darts/search_space/conv_bert/optimization.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/optimization.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/batching.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/batching.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/pretraining.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/pretraining.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/squad.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/squad.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/tokenization.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/tokenization.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/convert_static_to_dygraph.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/convert_static_to_dygraph.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/fp16.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/fp16.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/init.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/init.py
--- a/paddleslim/nas/darts/train_search.py
+++ b/paddleslim/nas/darts/train_search.py
--- a/paddleslim/nas/early_stop/__init__.py
+++ b/paddleslim/nas/early_stop/__init__.py
--- a/paddleslim/nas/early_stop/early_stop.py
+++ b/paddleslim/nas/early_stop/early_stop.py
--- a/paddleslim/nas/early_stop/median_stop/__init__.py
+++ b/paddleslim/nas/early_stop/median_stop/__init__.py
--- a/paddleslim/nas/early_stop/median_stop/median_stop.py
+++ b/paddleslim/nas/early_stop/median_stop/median_stop.py
--- a/paddleslim/nas/ofa/__init__.py
+++ b/paddleslim/nas/ofa/__init__.py
--- a/paddleslim/nas/ofa/convert_super.py
+++ b/paddleslim/nas/ofa/convert_super.py
--- a/paddleslim/nas/ofa/layers.py
+++ b/paddleslim/nas/ofa/layers.py
--- a/paddleslim/nas/ofa/ofa.py
+++ b/paddleslim/nas/ofa/ofa.py
--- a/paddleslim/nas/ofa/utils/__init__.py
+++ b/paddleslim/nas/ofa/utils/__init__.py
--- a/paddleslim/nas/ofa/utils/utils.py
+++ b/paddleslim/nas/ofa/utils/utils.py
--- a/paddleslim/nas/one_shot/__init__.py
+++ b/paddleslim/nas/one_shot/__init__.py
--- a/paddleslim/nas/one_shot/one_shot_nas.py
+++ b/paddleslim/nas/one_shot/one_shot_nas.py
--- a/paddleslim/nas/one_shot/super_mnasnet.py
+++ b/paddleslim/nas/one_shot/super_mnasnet.py
--- a/paddleslim/nas/rl_nas.py
+++ b/paddleslim/nas/rl_nas.py
--- a/paddleslim/nas/sa_nas.py
+++ b/paddleslim/nas/sa_nas.py
--- a/paddleslim/nas/search_space/__init__.py
+++ b/paddleslim/nas/search_space/__init__.py
--- a/paddleslim/nas/search_space/base_layer.py
+++ b/paddleslim/nas/search_space/base_layer.py
--- a/paddleslim/nas/search_space/combine_search_space.py
+++ b/paddleslim/nas/search_space/combine_search_space.py
--- a/paddleslim/nas/search_space/darts_space.py
+++ b/paddleslim/nas/search_space/darts_space.py
--- a/paddleslim/nas/search_space/inception_block.py
+++ b/paddleslim/nas/search_space/inception_block.py
--- a/paddleslim/nas/search_space/mobilenet_block.py
+++ b/paddleslim/nas/search_space/mobilenet_block.py
--- a/paddleslim/nas/search_space/mobilenetv1.py
+++ b/paddleslim/nas/search_space/mobilenetv1.py
--- a/paddleslim/nas/search_space/mobilenetv2.py
+++ b/paddleslim/nas/search_space/mobilenetv2.py
--- a/paddleslim/nas/search_space/resnet.py
+++ b/paddleslim/nas/search_space/resnet.py
--- a/paddleslim/nas/search_space/resnet_block.py
+++ b/paddleslim/nas/search_space/resnet_block.py
--- a/paddleslim/nas/search_space/search_space_base.py
+++ b/paddleslim/nas/search_space/search_space_base.py
--- a/paddleslim/nas/search_space/utils.py
+++ b/paddleslim/nas/search_space/utils.py
--- a/paddleslim/pantheon/README.md
+++ b/paddleslim/pantheon/README.md
--- a/paddleslim/pantheon/__init__.py
+++ b/paddleslim/pantheon/__init__.py
--- a/paddleslim/pantheon/images/pantheon_arch.png
+++ b/paddleslim/pantheon/images/pantheon_arch.png
--- a/paddleslim/pantheon/student.py
+++ b/paddleslim/pantheon/student.py
--- a/paddleslim/pantheon/teacher.py
+++ b/paddleslim/pantheon/teacher.py
--- a/paddleslim/pantheon/utils.py
+++ b/paddleslim/pantheon/utils.py
--- a/paddleslim/prune/__init__.py
+++ b/paddleslim/prune/__init__.py
--- a/paddleslim/prune/auto_pruner.py
+++ b/paddleslim/prune/auto_pruner.py
--- a/paddleslim/prune/controller_client.py
+++ b/paddleslim/prune/controller_client.py
--- a/paddleslim/prune/controller_server.py
+++ b/paddleslim/prune/controller_server.py
--- a/paddleslim/prune/criterion.py
+++ b/paddleslim/prune/criterion.py
--- a/paddleslim/prune/group_param.py
+++ b/paddleslim/prune/group_param.py
--- a/paddleslim/prune/idx_selector.py
+++ b/paddleslim/prune/idx_selector.py
--- a/paddleslim/prune/prune_io.py
+++ b/paddleslim/prune/prune_io.py
--- a/paddleslim/prune/prune_walker.py
+++ b/paddleslim/prune/prune_walker.py
--- a/paddleslim/prune/pruner.py
+++ b/paddleslim/prune/pruner.py
--- a/paddleslim/prune/sensitive.py
+++ b/paddleslim/prune/sensitive.py
--- a/paddleslim/prune/sensitive_pruner.py
+++ b/paddleslim/prune/sensitive_pruner.py
--- a/paddleslim/quant/__init__.py
+++ b/paddleslim/quant/__init__.py
--- a/paddleslim/quant/quant_embedding.py
+++ b/paddleslim/quant/quant_embedding.py
--- a/paddleslim/quant/quanter.py
+++ b/paddleslim/quant/quanter.py
--- a/paddleslim/teachers/__init__.py
+++ b/paddleslim/teachers/__init__.py
--- a/paddleslim/teachers/bert/__init__.py
+++ b/paddleslim/teachers/bert/__init__.py
--- a/paddleslim/teachers/bert/cls.py
+++ b/paddleslim/teachers/bert/cls.py
--- a/paddleslim/teachers/bert/model/__init__.py
+++ b/paddleslim/teachers/bert/model/__init__.py
--- a/paddleslim/teachers/bert/model/bert.py
+++ b/paddleslim/teachers/bert/model/bert.py
--- a/paddleslim/teachers/bert/model/cls.py
+++ b/paddleslim/teachers/bert/model/cls.py
--- a/paddleslim/teachers/bert/model/transformer_encoder.py
+++ b/paddleslim/teachers/bert/model/transformer_encoder.py
--- a/paddleslim/teachers/bert/optimization.py
+++ b/paddleslim/teachers/bert/optimization.py
--- a/paddleslim/teachers/bert/reader/__init__.py
+++ b/paddleslim/teachers/bert/reader/__init__.py
--- a/paddleslim/teachers/bert/reader/batching.py
+++ b/paddleslim/teachers/bert/reader/batching.py
--- a/paddleslim/teachers/bert/reader/cls.py
+++ b/paddleslim/teachers/bert/reader/cls.py
--- a/paddleslim/teachers/bert/reader/tokenization.py
+++ b/paddleslim/teachers/bert/reader/tokenization.py
--- a/paddleslim/teachers/bert/utils/__init__.py
+++ b/paddleslim/teachers/bert/utils/__init__.py
--- a/paddleslim/teachers/bert/utils/convert_static_to_dygraph.py
+++ b/paddleslim/teachers/bert/utils/convert_static_to_dygraph.py
--- a/paddleslim/teachers/bert/utils/fp16.py
+++ b/paddleslim/teachers/bert/utils/fp16.py
--- a/paddleslim/teachers/bert/utils/init.py
+++ b/paddleslim/teachers/bert/utils/init.py
--- a/paddleslim/version.py
+++ b/paddleslim/version.py
--- a/requirements.txt
+++ b/requirements.txt
--- a/setup.py
+++ b/setup.py
--- a/tests/layers.py
+++ b/tests/layers.py
--- a/tests/test_autoprune.py
+++ b/tests/test_autoprune.py
--- a/tests/test_client_connect.py
+++ b/tests/test_client_connect.py
--- a/tests/test_darts.py
+++ b/tests/test_darts.py
--- a/tests/test_deep_mutual_learning.py
+++ b/tests/test_deep_mutual_learning.py
--- a/tests/test_earlystop.py
+++ b/tests/test_earlystop.py
--- a/tests/test_flops.py
+++ b/tests/test_flops.py
--- a/tests/test_fpgm_prune.py
+++ b/tests/test_fpgm_prune.py
--- a/tests/test_fsp_loss.py
+++ b/tests/test_fsp_loss.py
--- a/tests/test_group_param.py
+++ b/tests/test_group_param.py
--- a/tests/test_l2_loss.py
+++ b/tests/test_l2_loss.py
--- a/tests/test_loss.py
+++ b/tests/test_loss.py
--- a/tests/test_merge.py
+++ b/tests/test_merge.py
--- a/tests/test_nas_search_space.py
+++ b/tests/test_nas_search_space.py
--- a/tests/test_ofa.py
+++ b/tests/test_ofa.py
--- a/tests/test_optimal_threshold.py
+++ b/tests/test_optimal_threshold.py
--- a/tests/test_prune.py
+++ b/tests/test_prune.py
--- a/tests/test_prune_walker.py
+++ b/tests/test_prune_walker.py
--- a/tests/test_pruned_model_save_load.py
+++ b/tests/test_pruned_model_save_load.py
--- a/tests/test_quant_aware.py
+++ b/tests/test_quant_aware.py
--- a/tests/test_quant_aware_user_defined.py
+++ b/tests/test_quant_aware_user_defined.py
--- a/tests/test_quant_embedding.py
+++ b/tests/test_quant_embedding.py
--- a/tests/test_quant_post.py
+++ b/tests/test_quant_post.py
--- a/tests/test_quant_post_only_weight.py
+++ b/tests/test_quant_post_only_weight.py
--- a/tests/test_rl_nas.py
+++ b/tests/test_rl_nas.py
--- a/tests/test_sa_nas.py
+++ b/tests/test_sa_nas.py
--- a/tests/test_sensitivity.py
+++ b/tests/test_sensitivity.py
--- a/tests/test_auto_prune.py
+++ b/tests/test_auto_prune.py
--- a/tests/test_soft_label_loss.py
+++ b/tests/test_soft_label_loss.py