update

3eb807ea · ceci3 · 28e48223 · 2bb6377f · 3eb807ea · 3eb807ea
240 changed file
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+
--- a/README.md
+++ b/README.md
+# PaddleSlim
+
 中文 | [English](README_en.md)

-文档：https://paddlepaddle.github.io/PaddleSlim
-
-# PaddleSlim
+[![Documentation Status](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://paddleslim.readthedocs.io/en/latest/)
+[![Documentation Status](https://img.shields.io/badge/中文文档-最新-brightgreen.svg)](https://paddleslim.readthedocs.io/zh_CN/latest/)
+[![License](https://img.shields.io/badge/license-Apache%202-blue.svg)](LICENSE)

 PaddleSlim是一个模型压缩工具库，包含模型剪裁、定点量化、知识蒸馏、超参搜索和模型结构搜索等一系列模型压缩策略。

@@ -16,36 +18,213 @@ PaddleSlim会从底层能力、技术咨询合作和业务场景等角度支持

 ## 功能

- 模型剪裁
-  - 卷积通道均匀剪裁
-  - 基于敏感度的卷积通道剪裁
-  - 基于进化算法的自动剪裁
+<table style="width:100%;" cellpadding="2" cellspacing="0" border="1" bordercolor="#000000">
+	<tbody>
+		<tr>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">功能模块</span>
+			</td>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">算法</span>
+			</td>
+			<td style="text-align:center;">
+				<span style="font-size:18px;">教程</span><span style="font-size:18px;">与文档</span>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				<span style="font-size:12px;">剪裁</span><span style="font-size:12px;"></span><br />
+			</td>
+			<td>
+				<ul>
+					<li>
+						Sensitivity&nbsp;&nbsp;Pruner:&nbsp;<a href="https://arxiv.org/abs/1608.08710" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Li H , Kadav A , Durdanovic I , et al. Pruning Filters for Efficient ConvNets[J]. 2016.</span></span></a>
+					</li>
+					<li>
+						AMC Pruner:&nbsp;<a href="https://arxiv.org/abs/1802.03494" target="_blank"><span style="font-family:&quot;font-size:13px;background-color:#FFFFFF;">He, Yihui , et al. "AMC: AutoML for Model Compression and Acceleration on Mobile Devices." (2018).</span></a>
+					</li>
+					<li>
+						FFPGM Pruner:&nbsp;<a href="https://arxiv.org/abs/1811.00250" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">He Y , Liu P , Wang Z , et al. Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration[C]// IEEE/CVF Conference on Computer Vision &amp; Pattern Recognition. IEEE, 2019.</span></a>
+					</li>
+					<li>
+						Slim Pruner:<span style="background-color:#FFFDFA;">&nbsp;<a href="https://arxiv.org/pdf/1708.06519.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Liu Z , Li J , Shen Z , et al. Learning Efficient Convolutional Networks through Network Slimming[J]. 2017.</span></a></span>
+					</li>
+					<li>
+						<span style="background-color:#FFFDFA;">Opt Slim Pruner:&nbsp;<a href="https://arxiv.org/pdf/1708.06519.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Ye Y , You G , Fwu J K , et al. Channel Pruning via Optimal Thresholding[J]. 2020.</span></a><br />
+</span>
+					</li>
+				</ul>
+			</td>
+			<td>
+					<ul>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/prune_api.rst" target="_blank">剪裁模块API文档</a>
+						</li>
+					        <li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/pruning_tutorial.md" target="_blank">剪裁快速开始示例</a>
+						</li>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md" target="_blank">分类模敏感度分析教程</a>
+						</li>
+						<li>
+							<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_pruing_tutorial.md" target="_blank">检测模型剪裁教程</a>
+						</li>
+						<li>
+								<span id="__kindeditor_bookmark_start_313__"></span><a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_prune_dist_tutorial.md" target="_blank">检测模型剪裁+蒸馏教程</a>
+						</li>
+						<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_sensitivy_tutorial.md" target="_blank">检测模型敏感度分析教程</a>
+						</li>
+					</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				量化
+			</td>
+			<td>
+				<ul>
+					<li>
+						Quantization Aware Training:&nbsp;<a href="https://arxiv.org/abs/1806.08342" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Krishnamoorthi R . Quantizing deep convolutional networks for efficient inference: A whitepaper[J]. 2018.</span></a>
+					</li>
+					<li>
+						Post Training&nbsp;<span>Quantization&nbsp;</span><a href="http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf" target="_blank">原理</a>
+					</li>
+					<li>
+						Embedding&nbsp;<span>Quantization:&nbsp;<a href="https://arxiv.org/pdf/1603.01025.pdf" target="_blank"><span style="font-family:&quot;font-size:14px;background-color:#FFFFFF;">Miyashita D , Lee E H , Murmann B . Convolutional Neural Networks using Logarithmic Data Representation[J]. 2016.</span></a></span>
+					</li>
+					<li>
+						DSQ: <a href="https://arxiv.org/abs/1908.05033" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Gong, Ruihao, et al. "Differentiable soft quantization: Bridging full-precision and low-bit neural networks."&nbsp;</span><i>Proceedings of the IEEE International Conference on Computer Vision</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2019.</span></a>
+					</li>
+					<li>
+						PACT:&nbsp; <a href="https://arxiv.org/abs/1805.06085" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Choi, Jungwook, et al. "Pact: Parameterized clipping activation for quantized neural networks."&nbsp;</span><i>arXiv preprint arXiv:1805.06085</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2018).</span></a>
+					</li>
+				</ul>
+			</td>
+			<td>
+				<ul>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/quantization_api.rst" target="_blank">量化API文档</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/quant_aware_tutorial.md" target="_blank">量化训练快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/quant_post_static_tutorial.md" target="_blank">静态离线量化快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md" target="_blank">检测模型量化教程</a>
+					</li>
+				</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				蒸馏
+			</td>
+			<td>
+				<ul>
+					<li>
+						<span>Knowledge Distillation</span>:&nbsp;<a href="https://arxiv.org/abs/1503.02531" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network."&nbsp;</span><i>arXiv preprint arXiv:1503.02531</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2015).</span></a>
+					</li>
+					<li>
+						FSP <span>Knowledge Distillation</span>:&nbsp;&nbsp;<a href="http://openaccess.thecvf.com/content_cvpr_2017/papers/Yim_A_Gift_From_CVPR_2017_paper.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Yim, Junho, et al. "A gift from knowledge distillation: Fast optimization, network minimization and transfer learning."&nbsp;</span><i>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2017.</span></a>
+					</li>
+					<li>
+						YOLO Knowledge Distillation:&nbsp;&nbsp;<a href="http://openaccess.thecvf.com/content_ECCVW_2018/papers/11133/Mehta_Object_detection_at_200_Frames_Per_Second_ECCVW_2018_paper.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Mehta, Rakesh, and Cemalettin Ozturk. "Object detection at 200 frames per second."&nbsp;</span><i>Proceedings of the European Conference on Computer Vision (ECCV)</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2018.</span></a>
+					</li>
+					<li>
+						DML:&nbsp;<a href="https://arxiv.org/abs/1706.00384" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Zhang, Ying, et al. "Deep mutual learning."&nbsp;</span><i>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. 2018.</span></a>
+					</li>
+				</ul>
+			</td>
+			<td>
+				<ul>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/single_distiller_api.rst" target="_blank">蒸馏API文档</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/distillation_tutorial.md" target="_blank">蒸馏快速开始示例</a>
+					</li>
+					<li>
+						<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_distillation_tutorial.md" target="_blank">检测模型蒸馏教程</a>
+					</li>
+				</ul>
+			</td>
+		</tr>
+		<tr>
+			<td style="text-align:center;">
+				模型结构搜索(NAS)
+			</td>
+			<td>
+				<ul>
+					<li>
+						Simulate Anneal NAS:&nbsp;<a href="https://arxiv.org/pdf/2005.04117.pdf" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Abdelhamed, Abdelrahman, et al. "Ntire 2020 challenge on real image denoising: Dataset, methods and results."&nbsp;</span><i>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">. Vol. 2. 2020.</span></a>
+					</li>
+					<li>
+						DARTS <a href="https://arxiv.org/abs/1806.09055" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Liu, Hanxiao, Karen Simonyan, and Yiming Yang. "Darts: Differentiable architecture search."&nbsp;</span><i>arXiv preprint arXiv:1806.09055</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2018).</span></a>
+					</li>
+					<li>
+						PC-DARTS <a href="https://arxiv.org/abs/1907.05737" target="_blank"><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">Xu, Yuhui, et al. "Pc-darts: Partial channel connections for memory-efficient differentiable architecture search."&nbsp;</span><i>arXiv preprint arXiv:1907.05737</i><span style="color:#222222;font-family:Arial, sans-serif;font-size:13px;background-color:#FFFFFF;">&nbsp;(2019).</span></a>
+					</li>
+					<li>
+						OneShot&nbsp;
+					</li>
+				</ul>
+			</td>
+			<td>
+						<ul>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/nas_api.rst" target="_blank">NAS API文档</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/api_cn/darts.rst" target="_blank">DARTS API文档</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/quick_start/nas_tutorial.md" target="_blank">NAS快速开始示例</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/paddledetection_slim_nas_tutorial.md" target="_blank">检测模型NAS教程</a>
+							</li>
+							<li>
+								<a href="https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/sanas_darts_space.md" target="_blank">SANAS进阶版实验教程-压缩DARTS产出模型</a>
+							</li>
+						</ul>
+			</td>
+		</tr>
+	</tbody>
+</table>
+<br />

- 定点量化
-  - 在线量化训练（training aware）
-  - 离线量化（post training）
+## 安装

- 知识蒸馏
-  - 支持单进程知识蒸馏
-  - 支持多进程分布式知识蒸馏
+```bash
+pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### 量化和Paddle版本的对应关系

- 神经网络结构自动搜索（NAS）
-  - 支持基于进化算法的轻量神经网络结构自动搜索
-  - 支持One-Shot网络结构自动搜索
-  - 支持 FLOPS / 硬件延时约束
-  - 支持多平台模型延时评估
-  - 支持用户自定义搜索算法和搜索空间
+如果在ARM和GPU上预测，每个版本都可以，如果在CPU上预测，请选择Paddle 2.0对应的PaddleSlim 1.1.0版本

-## 安装
+- Paddle 1.7 系列版本，需要安装PaddleSlim 1.0.1版本

-依赖：
+```bash
+pip install paddleslim==1.0.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```

-Paddle >= 1.7.0
+- Paddle 1.8 系列版本，需要安装PaddleSlim 1.1.1版本

 ```bash
-pip install paddleslim -i https://pypi.org/simple
+pip install paddleslim==1.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```

+- Paddle 2.0 系列版本，需要安装PaddleSlim 1.1.0版本
+
+```bash
+pip install paddleslim==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+
+
 ## 使用

 - [快速开始](docs/zh_cn/quick_start)：通过简单示例介绍如何快速使用PaddleSlim。
@@ -97,3 +276,10 @@ pip install paddleslim -i https://pypi.org/simple
 | RK3288  | [-23%]()    | +0.07%    |
 | Android cellphone  | [-20%]()    | +0.16% |
 | iPhone 6s   | [-17%]()    | +0.32%  |
+
+## 许可证书
+本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
+
+## 如何贡献代码
+
+我们非常欢迎你可以为PaddleSlim提供代码，也十分感谢你的反馈。
--- a/README_cn.md
+++ b/README_cn.md
-中文 | [English](README.md)
-
-文档：https://paddlepaddle.github.io/PaddleSlim
-
-# PaddleSlim
-
-PaddleSlim是一个模型压缩工具库，包含模型剪裁、定点量化、知识蒸馏、超参搜索和模型结构搜索等一系列模型压缩策略。
-
-对于业务用户，PaddleSlim提供完整的模型压缩解决方案，可用于图像分类、检测、分割等各种类型的视觉场景。
-同时也在持续探索NLP领域模型的压缩方案。另外，PaddleSlim提供且在不断完善各种压缩策略在经典开源任务的benchmark,
-以便业务用户参考。
-
-对于模型压缩算法研究者或开发者，PaddleSlim提供各种压缩策略的底层辅助接口，方便用户复现、调研和使用最新论文方法。
-PaddleSlim会从底层能力、技术咨询合作和业务场景等角度支持开发者进行模型压缩策略相关的创新工作。
-
-
-## 功能
-
- 模型剪裁
-  - 卷积通道均匀剪裁
-  - 基于敏感度的卷积通道剪裁
-  - 基于进化算法的自动剪裁
-
- 定点量化
-  - 在线量化训练（training aware）
-  - 离线量化（post training）
-
- 知识蒸馏
-  - 支持单进程知识蒸馏
-  - 支持多进程分布式知识蒸馏
-
- 神经网络结构自动搜索（NAS）
-  - 支持基于进化算法的轻量神经网络结构自动搜索
-  - 支持One-Shot网络结构自动搜索
-  - 支持 FLOPS / 硬件延时约束
-  - 支持多平台模型延时评估
-  - 支持用户自定义搜索算法和搜索空间
-
-## 安装
-
-依赖：
-
-Paddle >= 1.7.0
-
-```bash
-pip install paddleslim -i https://pypi.org/simple
-```
-
-## 使用
-
- [快速开始](docs/zh_cn/quick_start)：通过简单示例介绍如何快速使用PaddleSlim。
- [进阶教程](docs/zh_cn/tutorials)：PaddleSlim高阶教程。
- [模型库](docs/zh_cn/model_zoo.md)：各个压缩策略在图像分类、目标检测和图像语义分割模型上的实验结论，包括模型精度、预测速度和可供下载的预训练模型。
- [API文档](https://paddlepaddle.github.io/PaddleSlim/api_cn/index.html)
- [算法原理](https://paddlepaddle.github.io/PaddleSlim/algo/algo.html): 介绍量化、剪枝、蒸馏、NAS的基本知识背景。
- [Paddle检测库](https://github.com/PaddlePaddle/PaddleDetection/tree/master/slim)：介绍如何在检测库中使用PaddleSlim。
- [Paddle分割库](https://github.com/PaddlePaddle/PaddleSeg/tree/develop/slim)：介绍如何在分割库中使用PaddleSlim。
- [PaddleLite](https://paddlepaddle.github.io/Paddle-Lite/)：介绍如何使用预测库PaddleLite部署PaddleSlim产出的模型。
-
-## 部分压缩策略效果
-
-### 分类模型
-
-数据: ImageNet2012; 模型: MobileNetV1;
-
-|压缩策略 |精度收益(baseline: 70.91%) |模型大小(baseline: 17.0M)|
-|:---:|:---:|:---:|
-| 知识蒸馏(ResNet50)| [+1.06%]() |-|
-| 知识蒸馏(ResNet50) + int8量化训练 |[+1.10%]()| [-71.76%]()|
-| 剪裁(FLOPs-50%) + int8量化训练|[-1.71%]()|[-86.47%]()|
-
-
-### 图像检测模型
-
-#### 数据：Pascal VOC；模型：MobileNet-V1-YOLOv3
-
-|        压缩方法           | mAP(baseline: 76.2%)         | 模型大小(baseline: 94MB)      |
-| :---------------------:   | :------------: | :------------:|
-| 知识蒸馏(ResNet34-YOLOv3) | [+2.8%](#)      |       -       |
-| 剪裁 FLOPs -52.88%        | [+1.4%]()      | [-67.76%]()   |
-|知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-69.57%)| [+2.6%]()|[-67.00%]()|
-
-
-#### 数据：COCO；模型：MobileNet-V1-YOLOv3
-
-|        压缩方法           | mAP(baseline: 29.3%) | 模型大小|
-| :---------------------:   | :------------: | :------:|
-| 知识蒸馏(ResNet34-YOLOv3) |  [+2.1%]()     |-|
-| 知识蒸馏(ResNet34-YOLOv3)+剪裁(FLOPs-67.56%) | [-0.3%]() | [-66.90%]()|
-
-### 搜索
-
-数据：ImageNet2012; 模型：MobileNetV2
-
-|硬件环境           | 推理耗时 | Top1准确率(baseline:71.90%) |
-|:---------------:|:---------:|:--------------------:|
-| RK3288  | [-23%]()    | +0.07%    |
-| Android cellphone  | [-20%]()    | +0.16% |
-| iPhone 6s   | [-17%]()    | +0.32%  |
--- a/README_en.md
+++ b/README_en.md
@@ -56,6 +56,30 @@ Paddle >= 1.7.0
 pip install paddleslim -i https://pypi.org/simple
 ```

+### quantization
+
+If you want to use quantization in PaddleSlim, please install PaddleSlim as follows.
+
+If you want to use quantized model in ARM and GPU, any PaddleSlim version is ok and you should install 1.1.0 for CPU.
+
+- For Paddle 1.7, install PaddleSlim 1.0.1
+
+```bash
+pip install paddleslim==1.0.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+- For Paddle 1.8，install PaddleSlim 1.1.1
+
+```bash
+pip install paddleslim==1.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
+- For Paddle 2.0 ，install PaddleSlim 1.1.0
+
+```bash
+pip install paddleslim==1.1.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+
 ## Usage

 - [QuickStart](https://paddlepaddle.github.io/PaddleSlim/quick_start/index_en.html): Introduce how to use PaddleSlim by simple examples.

--- a/demo/DML/README.md
+++ b/demo/DML/README.md
+# 深度互学习DML(Deep Mutual Learning)
+本示例介绍如何使用PaddleSlim的深度互学习DML方法训练模型，算法原理请参考论文 [Deep Mutual Learning](https://arxiv.org/abs/1706.00384)
+
+## 使用数据
+示例中使用cifar100数据集进行训练, 您可以在启动训练时等待自动下载，
+也可以在自行下载[数据集](https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz)之后，放在当前目录的`./dataset/cifar100`路径下
+
+## 启动命令
+
+单卡训练, 以0号GPU为例：
+```bash
+CUDA_VISIBLE_DEVICES=0 python dml_train.py
+```
+
+多卡训练, 以0-3号GPU为例:
+```bash
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog dml_train.py --use_parallel=True
+```
+
+## 实验结果
+
+以下实验结果可以由默认实验配置(学习率、优化器等)训练得到，仅调整了DML训练的模型组合
+
+如果想进一步提升实验结果可以尝试[更多优化tricks](https://arxiv.org/abs/1812.01187), 或进一步增加一次DML训练的模型数量。
+
+| 数据集 | 网络模型 |  单独训练准确率 | 深度互学习准确率 |
+| ------ | ------ | ------ | ------ |
+| CIFAR100 | MobileNet X 2 | 73.65% | 76.34% (+2.69%) |
+| CIFAR100 | MobileNet X 4 | 73.65% | 76.56% (+2.91%) |
+| CIFAR100 | MobileNet + ResNet50 | 73.65%/76.52% | 76.00%/77.80% (+2.35%/+1.28%) |
--- a/demo/DML/cifar100_reader.py
+++ b/demo/DML/cifar100_reader.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from PIL import Image
+from PIL import ImageOps
+import os
+import math
+import random
+import tarfile
+import functools
+import numpy as np
+from PIL import Image, ImageEnhance
+import paddle
+# for python2/python3 compatiablity
+try:
+    import cPickle
+except:
+    import _pickle as cPickle
+
+IMAGE_SIZE = 32
+IMAGE_DEPTH = 3
+CIFAR_MEAN = [0.5070751592371323, 0.48654887331495095, 0.4409178433670343]
+CIFAR_STD = [0.2673342858792401, 0.2564384629170883, 0.27615047132568404]
+
+URL_PREFIX = 'https://www.cs.toronto.edu/~kriz/'
+CIFAR100_URL = URL_PREFIX + 'cifar-100-python.tar.gz'
+CIFAR100_MD5 = 'eb9058c3a382ffc7106e4002c42a8d85'
+paddle.dataset.common.DATA_HOME = "dataset/"
+
+
+def preprocess(sample, is_training):
+    image_array = sample.reshape(IMAGE_DEPTH, IMAGE_SIZE, IMAGE_SIZE)
+    rgb_array = np.transpose(image_array, (1, 2, 0))
+    img = Image.fromarray(rgb_array, 'RGB')
+
+    if is_training:
+        # pad, ramdom crop, random_flip_left_right, random_rotation
+        img = ImageOps.expand(img, (4, 4, 4, 4), fill=0)
+        left_top = np.random.randint(8, size=2)
+        img = img.crop((left_top[1], left_top[0], left_top[1] + IMAGE_SIZE,
+                        left_top[0] + IMAGE_SIZE))
+        if np.random.randint(2):
+            img = img.transpose(Image.FLIP_LEFT_RIGHT)
+        random_angle = np.random.randint(-15, 15)
+        img = img.rotate(random_angle, Image.NEAREST)
+    img = np.array(img).astype(np.float32)
+
+    img_float = img / 255.0
+    img = (img_float - CIFAR_MEAN) / CIFAR_STD
+
+    img = np.transpose(img, (2, 0, 1))
+    return img
+
+
+def reader_generator(datasets, batch_size, is_training, is_shuffle):
+    def read_batch(datasets):
+        if is_shuffle:
+            random.shuffle(datasets)
+        for im, label in datasets:
+            im = preprocess(im, is_training)
+            yield im, [int(label)]
+
+    def reader():
+        batch_data = []
+        batch_label = []
+        for data in read_batch(datasets):
+            batch_data.append(data[0])
+            batch_label.append(data[1])
+            if len(batch_data) == batch_size:
+                batch_data = np.array(batch_data, dtype='float32')
+                batch_label = np.array(batch_label, dtype='int64')
+                batch_out = [batch_data, batch_label]
+                yield batch_out
+                batch_data = []
+                batch_label = []
+
+    return reader
+
+
+def cifar100_reader(file_name, data_name, is_shuffle):
+    with tarfile.open(file_name, mode='r') as f:
+        names = [
+            each_item.name for each_item in f if data_name in each_item.name
+        ]
+        names.sort()
+        datasets = []
+        for name in names:
+            print("Reading file " + name)
+            try:
+                batch = cPickle.load(
+                    f.extractfile(name), encoding='iso-8859-1')
+            except:
+                batch = cPickle.load(f.extractfile(name))
+            data = batch['data']
+            labels = batch.get('labels', batch.get('fine_labels', None))
+            assert labels is not None
+            dataset = zip(data, labels)
+            datasets.extend(dataset)
+        if is_shuffle:
+            random.shuffle(datasets)
+    return datasets
+
+
+def train_valid(batch_size, is_train, is_shuffle):
+    name = 'train' if is_train else 'test'
+    datasets = cifar100_reader(
+        paddle.dataset.common.download(CIFAR100_URL, 'cifar', CIFAR100_MD5),
+        name, is_shuffle)
+    reader = reader_generator(datasets, batch_size, is_train, is_shuffle)
+    return reader
--- a/demo/DML/dml_train.py
+++ b/demo/DML/dml_train.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import argparse
+import functools
+import logging
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.dist import DML
+from paddleslim.models.dygraph import MobileNetV1
+import cifar100_reader as reader
+sys.path[0] = os.path.join(os.path.dirname("__file__"), os.path.pardir)
+from utility import add_arguments, print_arguments
+
+logger = get_logger(__name__, level=logging.INFO)
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('log_freq',          int,   100,              "Log frequency.")
+add_arg('batch_size',        int,   256,             "Minibatch size.")
+add_arg('init_lr',           float, 0.1,             "The start learning rate.")
+add_arg('use_gpu',           bool,  True,            "Whether use GPU.")
+add_arg('epochs',            int,   200,             "Epoch number.")
+add_arg('class_num',         int,   100,             "Class number of dataset.")
+add_arg('trainset_num',      int,   50000,           "Images number of trainset.")
+add_arg('model_save_dir',    str,   'saved_models',  "The path to save model.")
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
+add_arg('use_parallel',      bool,  False,           "Whether to use data parallel mode to train the model.")
+# yapf: enable
+
+
+def create_optimizer(models, args):
+    device_num = fluid.dygraph.parallel.Env().nranks
+    step = int(args.trainset_num / (args.batch_size * device_num))
+    epochs = [60, 120, 180]
+    bd = [step * e for e in epochs]
+    lr = [args.init_lr * (0.1**i) for i in range(len(bd) + 1)]
+
+    optimizers = []
+    for cur_model in models:
+        learning_rate = fluid.dygraph.PiecewiseDecay(bd, lr, 0)
+        opt = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            0.9,
+            parameter_list=cur_model.parameters(),
+            use_nesterov=True,
+            regularization=fluid.regularizer.L2DecayRegularizer(5e-4))
+        optimizers.append(opt)
+    return optimizers
+
+
+def create_reader(place, args):
+    train_reader = reader.train_valid(
+        batch_size=args.batch_size, is_train=True, is_shuffle=True)
+    valid_reader = reader.train_valid(
+        batch_size=args.batch_size, is_train=False, is_shuffle=False)
+    if args.use_parallel:
+        train_reader = fluid.contrib.reader.distributed_batch_reader(
+            train_reader)
+    train_loader = fluid.io.DataLoader.from_generator(
+        capacity=1024,
+        return_list=True,
+        use_multiprocess=args.use_multiprocess)
+    valid_loader = fluid.io.DataLoader.from_generator(
+        capacity=1024,
+        return_list=True,
+        use_multiprocess=args.use_multiprocess)
+    train_loader.set_batch_generator(train_reader, places=place)
+    valid_loader.set_batch_generator(valid_reader, places=place)
+    return train_loader, valid_loader
+
+
+def train(train_loader, dml_model, dml_optimizer, args):
+    dml_model.train()
+    costs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    accs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    for step_id, (images, labels) in enumerate(train_loader):
+        images, labels = to_variable(images), to_variable(labels)
+        batch_size = images.shape[0]
+
+        logits = dml_model.forward(images)
+        precs = [
+            fluid.layers.accuracy(
+                input=l, label=labels, k=1) for l in logits
+        ]
+        losses = dml_model.loss(logits, labels)
+        dml_optimizer.minimize(losses)
+
+        for i in range(dml_model.model_num):
+            accs[i].update(precs[i].numpy(), batch_size)
+            costs[i].update(losses[i].numpy(), batch_size)
+        model_names = dml_model.full_name()
+        if step_id % args.log_freq == 0:
+            log_msg = "Train Step {}".format(step_id)
+            for model_id, (cost, acc) in enumerate(zip(costs, accs)):
+                log_msg += ", {} loss: {:.6f} acc: {:.6f}".format(
+                    model_names[model_id], cost.avg[0], acc.avg[0])
+            logger.info(log_msg)
+    return costs, accs
+
+
+def valid(valid_loader, dml_model, args):
+    dml_model.eval()
+    costs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    accs = [AvgrageMeter() for i in range(dml_model.model_num)]
+    for step_id, (images, labels) in enumerate(valid_loader):
+        images, labels = to_variable(images), to_variable(labels)
+        batch_size = images.shape[0]
+
+        logits = dml_model.forward(images)
+        precs = [
+            fluid.layers.accuracy(
+                input=l, label=labels, k=1) for l in logits
+        ]
+        losses = dml_model.loss(logits, labels)
+
+        for i in range(dml_model.model_num):
+            accs[i].update(precs[i].numpy(), batch_size)
+            costs[i].update(losses[i].numpy(), batch_size)
+        model_names = dml_model.full_name()
+        if step_id % args.log_freq == 0:
+            log_msg = "Valid Step{} ".format(step_id)
+            for model_id, (cost, acc) in enumerate(zip(costs, accs)):
+                log_msg += ", {} loss: {:.6f} acc: {:.6f}".format(
+                    model_names[model_id], cost.avg[0], acc.avg[0])
+            logger.info(log_msg)
+    return costs, accs
+
+
+def main(args):
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    elif not args.use_parallel:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+    with fluid.dygraph.guard(place):
+        # 1. Define data reader
+        train_loader, valid_loader = create_reader(place, args)
+
+        # 2. Define neural network
+        models = [
+            MobileNetV1(class_dim=args.class_num),
+            MobileNetV1(class_dim=args.class_num)
+        ]
+        optimizers = create_optimizer(models, args)
+
+        # 3. Use PaddleSlim DML strategy
+        dml_model = DML(models, args.use_parallel)
+        dml_optimizer = dml_model.opt(optimizers)
+
+        # 4. Train your network
+        save_parameters = (not args.use_parallel) or (
+            args.use_parallel and fluid.dygraph.parallel.Env().local_rank == 0)
+        best_valid_acc = [0] * dml_model.model_num
+        for epoch_id in range(args.epochs):
+            current_step_lr = dml_optimizer.get_lr()
+            lr_msg = "Epoch {}".format(epoch_id)
+            for model_id, lr in enumerate(current_step_lr):
+                lr_msg += ", {} lr: {:.6f}".format(
+                    dml_model.full_name()[model_id], lr)
+            logger.info(lr_msg)
+            train_losses, train_accs = train(train_loader, dml_model,
+                                             dml_optimizer, args)
+            valid_losses, valid_accs = valid(valid_loader, dml_model, args)
+            for i in range(dml_model.model_num):
+                if valid_accs[i].avg[0] > best_valid_acc[i]:
+                    best_valid_acc[i] = valid_accs[i].avg[0]
+                    if save_parameters:
+                        fluid.save_dygraph(
+                            models[i].state_dict(),
+                            os.path.join(args.model_save_dir,
+                                         dml_model.full_name()[i],
+                                         "best_model"))
+                summery_msg = "Epoch {} {}: valid_loss {:.6f}, valid_acc {:.6f}, best_valid_acc {:.6f}"
+                logger.info(
+                    summery_msg.format(epoch_id,
+                                       dml_model.full_name()[i], valid_losses[
+                                           i].avg[0], valid_accs[i].avg[0],
+                                       best_valid_acc[i]))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+    main(args)
--- a/demo/auto_prune/train.py
+++ b/demo/auto_prune/train.py
@@ -116,8 +116,8 @@ def compress(args):

        fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)

-    val_reader = paddle.batch(val_reader, batch_size=args.batch_size)
-    train_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(val_reader, batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
        train_reader, batch_size=args.batch_size, drop_last=True)

    train_feeder = feeder = fluid.DataFeeder([image, label], place)

--- a/demo/auto_prune/train_finetune.py
+++ b/demo/auto_prune/train_finetune.py
@@ -34,12 +34,12 @@ add_arg('config_file',      str, None,                 "The config file for comp

 model_list = [m for m in dir(models) if "__" not in m]
 ratiolist = [
-#    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
-#    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
-        ]
+    #    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
+    #    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
+]


-def save_model(args, exe, train_prog, eval_prog,info):
+def save_model(args, exe, train_prog, eval_prog, info):
    model_path = os.path.join(args.model_save_dir, args.model, str(info))
    if not os.path.isdir(model_path):
        os.makedirs(model_path)
@@ -58,29 +58,31 @@ def piecewise_decay(args):
        regularization=fluid.regularizer.L2Decay(args.l2_decay))
    return optimizer

+
 def cosine_decay(args):
    step = int(math.ceil(float(args.total_images) / args.batch_size))
    learning_rate = fluid.layers.cosine_decay(
-        learning_rate=args.lr,
-        step_each_epoch=step,
-        epochs=args.num_epochs)
+        learning_rate=args.lr, step_each_epoch=step, epochs=args.num_epochs)
    optimizer = fluid.optimizer.Momentum(
        learning_rate=learning_rate,
        momentum=args.momentum_rate,
        regularization=fluid.regularizer.L2Decay(args.l2_decay))
    return optimizer

+
 def create_optimizer(args):
    if args.lr_strategy == "piecewise_decay":
        return piecewise_decay(args)
    elif args.lr_strategy == "cosine_decay":
        return cosine_decay(args)

+
 def compress(args):
-    class_dim=1000
-    image_shape="3,224,224"
+    class_dim = 1000
+    image_shape = "3,224,224"
    image_shape = [int(m) for m in image_shape.split(",")]
-    assert args.model in model_list, "{} is not in lists: {}".format(args.model, model_list)
+    assert args.model in model_list, "{} is not in lists: {}".format(
+        args.model, model_list)
    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
    # model definition
@@ -98,18 +100,22 @@ def compress(args):
    exe.run(fluid.default_startup_program())

    if args.pretrained_model:
+
        def if_exist(var):
-            exist =  os.path.exists(os.path.join(args.pretrained_model, var.name))
-            print("exist",exist)
+            exist = os.path.exists(
+                os.path.join(args.pretrained_model, var.name))
+            print("exist", exist)
            return exist
+
        #fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)

-    val_reader = paddle.batch(reader.val(), batch_size=args.batch_size)
-    train_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(reader.val(), batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
        reader.train(), batch_size=args.batch_size, drop_last=True)

    train_feeder = feeder = fluid.DataFeeder([image, label], place)
-    val_feeder = feeder = fluid.DataFeeder([image, label], place, program=val_program)
+    val_feeder = feeder = fluid.DataFeeder(
+        [image, label], place, program=val_program)

    def test(epoch, program):
        batch_id = 0
@@ -117,80 +123,99 @@ def compress(args):
        acc_top5_ns = []
        for data in val_reader():
            start_time = time.time()
-            acc_top1_n, acc_top5_n = exe.run(program,
-                                           feed=train_feeder.feed(data),
-                                           fetch_list=[acc_top1.name, acc_top5.name])
+            acc_top1_n, acc_top5_n = exe.run(
+                program,
+                feed=train_feeder.feed(data),
+                fetch_list=[acc_top1.name, acc_top5.name])
            end_time = time.time()
-            print("Eval epoch[{}] batch[{}] - acc_top1: {}; acc_top5: {}; time: {}".format(epoch, batch_id, np.mean(acc_top1_n), np.mean(acc_top5_n), end_time-start_time))
+            print(
+                "Eval epoch[{}] batch[{}] - acc_top1: {}; acc_top5: {}; time: {}".
+                format(epoch, batch_id,
+                       np.mean(acc_top1_n),
+                       np.mean(acc_top5_n), end_time - start_time))
            acc_top1_ns.append(np.mean(acc_top1_n))
            acc_top5_ns.append(np.mean(acc_top5_n))
            batch_id += 1

-        print("Final eval epoch[{}] - acc_top1: {}; acc_top5: {}".format(epoch, np.mean(np.array(acc_top1_ns)), np.mean(np.array(acc_top5_ns))))
-
+        print("Final eval epoch[{}] - acc_top1: {}; acc_top5: {}".format(
+            epoch,
+            np.mean(np.array(acc_top1_ns)), np.mean(np.array(acc_top5_ns))))

    def train(epoch, program):

        build_strategy = fluid.BuildStrategy()
        exec_strategy = fluid.ExecutionStrategy()
        train_program = fluid.compiler.CompiledProgram(
-                                   program).with_data_parallel(
-                                   loss_name=avg_cost.name,
-                                   build_strategy=build_strategy,
-                                   exec_strategy=exec_strategy)
+            program).with_data_parallel(
+                loss_name=avg_cost.name,
+                build_strategy=build_strategy,
+                exec_strategy=exec_strategy)

        batch_id = 0
        for data in train_reader():
            start_time = time.time()
-            loss_n, acc_top1_n, acc_top5_n,lr_n = exe.run(train_program,
-                                               feed=train_feeder.feed(data),
-                                               fetch_list=[avg_cost.name, acc_top1.name, acc_top5.name,"learning_rate"])
+            loss_n, acc_top1_n, acc_top5_n, lr_n = exe.run(
+                train_program,
+                feed=train_feeder.feed(data),
+                fetch_list=[
+                    avg_cost.name, acc_top1.name, acc_top5.name,
+                    "learning_rate"
+                ])
            end_time = time.time()
            loss_n = np.mean(loss_n)
            acc_top1_n = np.mean(acc_top1_n)
            acc_top5_n = np.mean(acc_top5_n)
            lr_n = np.mean(lr_n)
-            print("epoch[{}]-batch[{}] - loss: {}; acc_top1: {}; acc_top5: {};lrn: {}; time: {}".format(epoch, batch_id, loss_n, acc_top1_n, acc_top5_n, lr_n,end_time-start_time))
+            print(
+                "epoch[{}]-batch[{}] - loss: {}; acc_top1: {}; acc_top5: {};lrn: {}; time: {}".
+                format(epoch, batch_id, loss_n, acc_top1_n, acc_top5_n, lr_n,
+                       end_time - start_time))
            batch_id += 1

    params = []
    for param in fluid.default_main_program().global_block().all_parameters():
        #if "_weights" in  param.name and "conv1_weights" not in param.name:
-        if "_sep_weights" in  param.name:
+        if "_sep_weights" in param.name:
            params.append(param.name)
-    print("fops before pruning: {}".format(flops(fluid.default_main_program())))
+    print("fops before pruning: {}".format(
+        flops(fluid.default_main_program())))
    pruned_program_iter = fluid.default_main_program()
    pruned_val_program_iter = val_program
    for ratios in ratiolist:
        pruner = Pruner()
-        pruned_val_program_iter = pruner.prune(pruned_val_program_iter,
-                                  fluid.global_scope(),
-                                  params=params,
-                                  ratios=ratios,
-                                  place=place,
-                                  only_graph=True)
-
-
-        pruned_program_iter = pruner.prune(pruned_program_iter,
-                                  fluid.global_scope(),
-                                  params=params,
-                                  ratios=ratios,
-                                  place=place)
+        pruned_val_program_iter = pruner.prune(
+            pruned_val_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place,
+            only_graph=True)
+
+        pruned_program_iter = pruner.prune(
+            pruned_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place)
        print("fops after pruning: {}".format(flops(pruned_program_iter)))
-
    """ do not inherit learning rate """
-    if(os.path.exists(args.pretrained_model + "/learning_rate")):
-      os.remove( args.pretrained_model + "/learning_rate")
-    if(os.path.exists(args.pretrained_model + "/@LR_DECAY_COUNTER@")):
-      os.remove( args.pretrained_model + "/@LR_DECAY_COUNTER@")
-    fluid.io.load_vars(exe, args.pretrained_model , main_program = pruned_program_iter,  predicate=if_exist)
+    if (os.path.exists(args.pretrained_model + "/learning_rate")):
+        os.remove(args.pretrained_model + "/learning_rate")
+    if (os.path.exists(args.pretrained_model + "/@LR_DECAY_COUNTER@")):
+        os.remove(args.pretrained_model + "/@LR_DECAY_COUNTER@")
+    fluid.io.load_vars(
+        exe,
+        args.pretrained_model,
+        main_program=pruned_program_iter,
+        predicate=if_exist)

    pruned_program = pruned_program_iter
    pruned_val_program = pruned_val_program_iter
    for i in range(args.num_epochs):
        train(i, pruned_program)
        test(i, pruned_val_program)
-        save_model(args,exe,pruned_program,pruned_val_program,i)
+        save_model(args, exe, pruned_program, pruned_val_program, i)
+

 def main():
    args = parser.parse_args()

--- a/demo/auto_prune/train_iterator.py
+++ b/demo/auto_prune/train_iterator.py
@@ -41,9 +41,10 @@ add_arg('test_period',      int, 10,                 "Test period in epoches.")

 model_list = [m for m in dir(models) if "__" not in m]
 ratiolist = [
-#    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
-#    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
-        ]
+    #    [0.06, 0.0, 0.09, 0.03, 0.09, 0.02, 0.05, 0.03, 0.0, 0.07, 0.07, 0.05, 0.08],
+    #    [0.08, 0.02, 0.03, 0.13, 0.1, 0.06, 0.03, 0.04, 0.14, 0.02, 0.03, 0.02, 0.01],
+]
+

 def piecewise_decay(args):
    step = int(math.ceil(float(args.total_images) / args.batch_size))
@@ -121,8 +122,8 @@ def compress(args):

 #        fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)

-    val_reader = paddle.batch(val_reader, batch_size=args.batch_size)
-    train_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(val_reader, batch_size=args.batch_size)
+    train_reader = paddle.fluid.io.batch(
        train_reader, batch_size=args.batch_size, drop_last=True)

    train_feeder = feeder = fluid.DataFeeder([image, label], place)
@@ -194,21 +195,26 @@ def compress(args):

    for ratios in ratiolist:
        pruner = Pruner()
-        pruned_val_program_iter = pruner.prune(pruned_val_program_iter,
-                                  fluid.global_scope(),
-                                  params=params,
-                                  ratios=ratios,
-                                  place=place,
-                                  only_graph=True)
-
-
-        pruned_program_iter = pruner.prune(pruned_program_iter,
-                                  fluid.global_scope(),
-                                  params=params,
-                                  ratios=ratios,
-                                  place=place)
+        pruned_val_program_iter = pruner.prune(
+            pruned_val_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place,
+            only_graph=True)
+
+        pruned_program_iter = pruner.prune(
+            pruned_program_iter,
+            fluid.global_scope(),
+            params=params,
+            ratios=ratios,
+            place=place)
        print("fops after pruning: {}".format(flops(pruned_program_iter)))
-    fluid.io.load_vars(exe, args.pretrained_model , main_program = pruned_program_iter,  predicate=if_exist)
+    fluid.io.load_vars(
+        exe,
+        args.pretrained_model,
+        main_program=pruned_program_iter,
+        predicate=if_exist)

    pruner = AutoPruner(
        pruned_val_program_iter,
@@ -238,8 +244,6 @@ def compress(args):
        pruner.reward(score)


-
-
 def main():
    args = parser.parse_args()
    print_arguments(args)

--- a/demo/bert/run.sh
+++ b/demo/bert/run.sh
+CUDA_VISIBLE_DEVICES=0 python2 -u train_cell_base.py
--- a/demo/bert/search_bert.py
+++ b/demo/bert/search_bert.py
+import paddle.fluid as fluid
+from paddleslim.teachers.bert.reader.cls import *
+from paddleslim.nas.darts.search_space import AdaBERTClassifier
+from paddleslim.nas.darts import DARTSearch
+
+
+def main():
+    place = fluid.CUDAPlace(0)
+
+    BERT_BASE_PATH = "./data/pretrained_models/uncased_L-12_H-768_A-12/"
+    bert_config_path = BERT_BASE_PATH + "/bert_config.json"
+    vocab_path = BERT_BASE_PATH + "/vocab.txt"
+    data_dir = "./data/glue_data/MNLI/"
+    max_seq_len = 512
+    do_lower_case = True
+    batch_size = 32
+    epoch = 30
+
+    processor = MnliProcessor(
+        data_dir=data_dir,
+        vocab_path=vocab_path,
+        max_seq_len=max_seq_len,
+        do_lower_case=do_lower_case,
+        in_tokens=False)
+
+    train_reader = processor.data_generator(
+        batch_size=batch_size,
+        phase='train',
+        epoch=epoch,
+        dev_count=1,
+        shuffle=True)
+
+    val_reader = processor.data_generator(
+        batch_size=batch_size,
+        phase='train',
+        epoch=epoch,
+        dev_count=1,
+        shuffle=True)
+
+    with fluid.dygraph.guard(place):
+        model = AdaBERTClassifier(
+            3,
+            teacher_model="/work/PaddleSlim/demo/bert_1/checkpoints/steps_23000"
+        )
+        searcher = DARTSearch(
+            model,
+            train_reader,
+            val_reader,
+            batchsize=batch_size,
+            num_epochs=epoch,
+            log_freq=10)
+        searcher.train()
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/bert/train_cell_base.py
+++ b/demo/bert/train_cell_base.py
+import numpy as np
+from itertools import izip
+import paddle.fluid as fluid
+from paddleslim.teachers.bert.reader.cls import *
+from paddleslim.nas.darts.search_space import AdaBERTClassifier
+from paddleslim.nas.darts.architect_for_bert import Architect
+
+import logging
+from paddleslim.common import AvgrageMeter, get_logger
+logger = get_logger(__name__, level=logging.INFO)
+
+
+def count_parameters_in_MB(all_params):
+    parameters_number = 0
+    for param in all_params:
+        if param.trainable:
+            parameters_number += np.prod(param.shape)
+    return parameters_number / 1e6
+
+
+def model_loss(model, data_ids):
+    # src_ids = data_ids[0]
+    # position_ids = data_ids[1]
+    # sentence_ids = data_ids[2]
+    # input_mask = data_ids[3]
+    labels = data_ids[4]
+    labels.stop_gradient = True
+
+    enc_output = model(data_ids)
+
+    ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+        logits=enc_output, label=labels, return_softmax=True)
+    loss = fluid.layers.mean(x=ce_loss)
+    num_seqs = fluid.layers.create_tensor(dtype='int64')
+    accuracy = fluid.layers.accuracy(input=probs, label=labels, total=num_seqs)
+    return loss, accuracy
+
+
+def train_one_epoch(model, architect, train_loader, valid_loader, optimizer,
+                    epoch, use_data_parallel, log_freq):
+    ce_losses = AvgrageMeter()
+    accs = AvgrageMeter()
+    model.train()
+
+    step_id = 0
+    for train_data, valid_data in izip(train_loader(), valid_loader):
+        architect.step(train_data, valid_data)
+        loss, acc = model_loss(model, train_data)
+
+        if use_data_parallel:
+            loss = model.scale_loss(loss)
+            loss.backward()
+            model.apply_collective_grads()
+        else:
+            loss.backward()
+
+        optimizer.minimize(loss)
+        model.clear_gradients()
+
+        batch_size = train_data[0].shape[0]
+        ce_losses.update(loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+
+        if step_id % log_freq == 0:
+            logger.info(
+                "Train Epoch {}, Step {}, Lr {:.6f} loss {:.6f}; acc: {:.6f};".
+                format(epoch, step_id,
+                       optimizer.current_step_lr(), ce_losses.avg[0], accs.avg[
+                           0]))
+        step_id += 1
+
+
+def valid_one_epoch(model, valid_loader, epoch, log_freq):
+    ce_losses = AvgrageMeter()
+    accs = AvgrageMeter()
+    model.eval()
+
+    step_id = 0
+    for valid_data in valid_loader():
+        loss, acc = model_loss(model, valid_data)
+
+        batch_size = valid_data[0].shape[0]
+        ce_losses.update(loss.numpy(), batch_size)
+        accs.update(acc.numpy(), batch_size)
+
+        if step_id % log_freq == 0:
+            logger.info("Valid Epoch {}, Step {}, loss {:.6f}; acc: {:.6f};".
+                        format(epoch, step_id, ce_losses.avg[0], accs.avg[0]))
+        step_id += 1
+
+
+def main():
+    use_data_parallel = False
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env(
+    ).dev_id) if use_data_parallel else fluid.CUDAPlace(0)
+
+    BERT_BASE_PATH = "./data/pretrained_models/uncased_L-12_H-768_A-12"
+    bert_config_path = BERT_BASE_PATH + "/bert_config.json"
+    vocab_path = BERT_BASE_PATH + "/vocab.txt"
+    data_dir = "./data/glue_data/MNLI/"
+    teacher_model_dir = "./teacher_model/steps_23000"
+    num_samples = 392702
+    max_seq_len = 128
+    do_lower_case = True
+    batch_size = 128
+    hidden_size = 768
+    emb_size = 768
+    max_layer = 8
+    epoch = 80
+    log_freq = 10
+    use_fixed_gumbel = True
+
+    processor = MnliProcessor(
+        data_dir=data_dir,
+        vocab_path=vocab_path,
+        max_seq_len=max_seq_len,
+        do_lower_case=do_lower_case,
+        in_tokens=False)
+
+    train_reader = processor.data_generator(
+        batch_size=batch_size,
+        phase='search_train',
+        epoch=1,
+        dev_count=1,
+        shuffle=True)
+
+    val_reader = processor.data_generator(
+        batch_size=batch_size,
+        phase='search_valid',
+        epoch=1,
+        dev_count=1,
+        shuffle=True)
+
+    if use_data_parallel:
+        train_reader = fluid.contrib.reader.distributed_batch_reader(
+            train_reader)
+        valid_reader = fluid.contrib.reader.distributed_batch_reader(
+            valid_reader)
+
+    with fluid.dygraph.guard(place):
+        model = AdaBERTClassifier(
+            3,
+            n_layer=max_layer,
+            hidden_size=hidden_size,
+            emb_size=emb_size,
+            teacher_model=teacher_model_dir,
+            data_dir=data_dir,
+            use_fixed_gumbel=use_fixed_gumbel)
+
+        if use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        device_num = fluid.dygraph.parallel.Env().nranks
+        step_per_epoch = int(num_samples / (batch_size * device_num))
+        learning_rate = fluid.dygraph.CosineDecay(2e-2, step_per_epoch, epoch)
+
+        model_parameters = [
+            p for p in model.parameters()
+            if p.name not in [a.name for a in model.arch_parameters()]
+        ]
+
+        clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=5.0)
+        optimizer = fluid.optimizer.MomentumOptimizer(
+            learning_rate,
+            0.9,
+            regularization=fluid.regularizer.L2DecayRegularizer(3e-4),
+            parameter_list=model_parameters,
+            grad_clip=clip)
+
+        train_loader = fluid.io.DataLoader.from_generator(
+            capacity=1024,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        valid_loader = fluid.io.DataLoader.from_generator(
+            capacity=1024,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+        train_loader.set_batch_generator(train_reader, places=place)
+        valid_loader.set_batch_generator(val_reader, places=place)
+
+        architect = Architect(model, learning_rate, 3e-4, place, False)
+
+        for epoch_id in range(epoch):
+            train_one_epoch(model, architect, train_loader, valid_loader,
+                            optimizer, epoch_id, use_data_parallel, log_freq)
+            valid_one_epoch(model, valid_loader, epoch_id, log_freq)
+            print(model.student._encoder.alphas.numpy())
+            print("=" * 100)
+
+
+if __name__ == '__main__':
+    main()
--- a/demo/bert/train_teacher.py
+++ b/demo/bert/train_teacher.py
+import paddle.fluid as fluid
+from paddleslim.teachers.bert import BERTClassifier
+
+place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+with fluid.dygraph.guard(place):
+
+    bert = BERTClassifier(3)
+    bert.fit("./data/glue_data/MNLI/",
+             5,
+             batch_size=32,
+             use_data_parallel=True,
+             learning_rate=0.00005,
+             save_steps=1000)
--- a/demo/darts/README.md
+++ b/demo/darts/README.md
@@ -2,9 +2,31 @@

 本示例介绍如何使用PaddlePaddle进行可微分架构搜索，可以直接使用[DARTS](https://arxiv.org/abs/1806.09055)和[PC-DARTS](https://arxiv.org/abs/1907.05737)两种方法，也支持自定义修改后使用其他可微分架构搜索算法。

+本示例目录结构如下：
+```
+├── genotypes.py 搜索过程得到的模型结构Genotypes
+│
+├── model.py 对搜索得到的子网络组网
+│
+├── model_search.py 对搜索前的超网络组网
+│
+├── operations.py 用于搜索的多种运算符组合
+│
+├── reader.py 数据读取与增广部分
+│
+├── search.py 模型结构搜索入口
+│
+├── train.py CIFAR10数据集评估训练入口
+│
+├── train_imagenet.py ImageNet数据集评估训练入口
+│
+├── visualize.py 模型结构可视化入口
+
+```
+
 ## 依赖项

-> PaddlePaddle >= 1.7.0, graphviz >= 0.11.1
+PaddlePaddle >= 1.8.0, PaddleSlim >= 1.1.0, graphviz >= 0.11.1

 ## 数据集

@@ -20,6 +42,15 @@ python search.py                       # DARTS一阶近似搜索方法
 python search.py --unrolled=True       # DARTS的二阶近似搜索方法
 python search.py --method='PC-DARTS' --batch_size=256 --learning_rate=0.1 --arch_learning_rate=6e-4 --epochs_no_archopt=15   # PC-DARTS搜索方法
 ```
+如果您使用的是docker环境，请确保共享内存足够使用多进程的dataloader，如果碰到共享内存问题，请设置`--use_multiprocess=False`
+
+也可以使用多卡进行模型结构搜索，以4卡为例(GPU id: 0-3), 启动命令如下:
+
+```bash
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog search.py --use_data_parallel 1
+```
+
+因为使用多卡训练总的BatchSize会扩大n倍，n代表卡数，为了获得与单卡相当的准确率效果，请相应的将初始学习率扩大n倍。

 模型结构随搜索轮数的变化如图1所示。需要注意的是，图中准确率Acc并不代表该结构最终准确率，为了获得当前结构的最佳准确率，请对得到的genotype做网络结构评估训练。

@@ -40,6 +71,15 @@ python train.py --arch='PC_DARTS'            # 在CIFAR10数据集上对搜索
 python train_imagenet.py --arch='PC_DARTS'   # 在ImageNet数据集上对搜索得到的结构评估训练
 ```

+同样，也支持用多卡进行评估训练, 以4卡为例(GPU id: 0-3), 启动命令如下：
+
+```bash
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3  --log_dir ./mylog train.py --use_data_parallel 1 --arch='DARTS_V2'
+python -m paddle.distributed.launch --selected_gpus=0,1,2,3  --log_dir ./mylog train_imagenet.py --use_data_parallel 1 --arch='DARTS_V2'
+```
+
+同理，使用多卡训练总的BatchSize会扩大n倍，n代表卡数，为了获得与单卡相当的准确率效果，请相应的将初始学习率扩大n倍。
+
 对搜索到的`DARTS_V1`、`DARTS_V2`和`PC-DARTS`做评估训练的结果如下：

 | 模型结构                    | 数据集   | 准确率          |

--- a/demo/darts/reader.py
+++ b/demo/darts/reader.py
@@ -37,8 +37,7 @@ IMAGE_DEPTH = 3
 CIFAR_MEAN = [0.49139968, 0.48215827, 0.44653124]
 CIFAR_STD = [0.24703233, 0.24348505, 0.26158768]

-URL_PREFIX = 'https://www.cs.toronto.edu/~kriz/'
-CIFAR10_URL = URL_PREFIX + 'cifar-10-python.tar.gz'
+CIFAR10_URL = 'https://dataset.bj.bcebos.com/cifar%2Fcifar-10-python.tar.gz'
 CIFAR10_MD5 = 'c58f30108f718f92721af3b95e74349a'

 paddle.dataset.common.DATA_HOME = "dataset/"
@@ -140,32 +139,10 @@ def train_search(batch_size, train_portion, is_shuffle, args):
    split_point = int(np.floor(train_portion * len(datasets)))
    train_datasets = datasets[:split_point]
    val_datasets = datasets[split_point:]
-    train_readers = []
-    val_readers = []
-    n = int(math.ceil(len(train_datasets) // args.num_workers)
-            ) if args.use_multiprocess else len(train_datasets)
-    train_datasets_lists = [
-        train_datasets[i:i + n] for i in range(0, len(train_datasets), n)
+    reader = [
+        reader_generator(train_datasets, batch_size, True, True, args),
+        reader_generator(val_datasets, batch_size, True, True, args)
    ]
-    val_datasets_lists = [
-        val_datasets[i:i + n] for i in range(0, len(val_datasets), n)
-    ]
-
-    for pid in range(len(train_datasets_lists)):
-        train_readers.append(
-            reader_generator(train_datasets_lists[pid], batch_size, True, True,
-                             args))
-        val_readers.append(
-            reader_generator(val_datasets_lists[pid], batch_size, True, True,
-                             args))
-    if args.use_multiprocess:
-        reader = [
-            paddle.reader.multiprocess_reader(train_readers, False),
-            paddle.reader.multiprocess_reader(val_readers, False)
-        ]
-    else:
-        reader = [train_readers[0], val_readers[0]]
-
    return reader


@@ -174,18 +151,8 @@ def train_valid(batch_size, is_train, is_shuffle, args):
    datasets = cifar10_reader(
        paddle.dataset.common.download(CIFAR10_URL, 'cifar', CIFAR10_MD5),
        name, is_shuffle, args)
-    n = int(math.ceil(len(datasets) // args.
-                      num_workers)) if args.use_multiprocess else len(datasets)
-    datasets_lists = [datasets[i:i + n] for i in range(0, len(datasets), n)]
-    multi_readers = []
-    for pid in range(len(datasets_lists)):
-        multi_readers.append(
-            reader_generator(datasets_lists[pid], batch_size, is_train,
-                             is_shuffle, args))
-    if args.use_multiprocess:
-        reader = paddle.reader.multiprocess_reader(multi_readers, False)
-    else:
-        reader = multi_readers[0]
+
+    reader = reader_generator(datasets, batch_size, is_train, is_shuffle, args)
    return reader



--- a/demo/darts/search.py
+++ b/demo/darts/search.py
@@ -35,9 +35,7 @@ add_arg = functools.partial(add_arguments, argparser=parser)

 # yapf: disable
 add_arg('log_freq',          int,   50,              "Log frequency.")
-add_arg('use_multiprocess',  bool,  False,            "Whether use multiprocess reader.")
-add_arg('num_workers',       int,   4,               "The multiprocess reader number.")
-add_arg('data',              str,   'dataset/cifar10',"The dir of dataset.")
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
 add_arg('batch_size',        int,   64,              "Minibatch size.")
 add_arg('learning_rate',     float, 0.025,            "The start learning rate.")
 add_arg('momentum',          float, 0.9,             "Momentum.")
@@ -80,6 +78,7 @@ def main(args):
            model,
            train_reader,
            valid_reader,
+            place,
            learning_rate=args.learning_rate,
            batchsize=args.batch_size,
            num_imgs=args.trainset_num,
@@ -87,8 +86,9 @@ def main(args):
            unrolled=args.unrolled,
            num_epochs=args.epochs,
            epochs_no_archopt=args.epochs_no_archopt,
-            use_gpu=args.use_gpu,
+            use_multiprocess=args.use_multiprocess,
            use_data_parallel=args.use_data_parallel,
+            save_dir=args.model_save_dir,
            log_freq=args.log_freq)
        searcher.train()


--- a/demo/darts/train.py
+++ b/demo/darts/train.py
@@ -19,13 +19,14 @@ from __future__ import print_function
 import os
 import sys
 import ast
+import logging
 import argparse
 import functools
-import logging

 import paddle.fluid as fluid
 from paddle.fluid.dygraph.base import to_variable
 from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.nas.darts import count_parameters_in_MB

 import genotypes
 import reader
@@ -38,8 +39,7 @@ parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)

 # yapf: disable
-add_arg('use_multiprocess',  bool,  False,            "Whether use multiprocess reader.")
-add_arg('num_workers',       int,   4,               "The multiprocess reader number.")
+add_arg('use_multiprocess',  bool,  True,            "Whether use multiprocess reader.")
 add_arg('data',              str,   'dataset/cifar10',"The dir of dataset.")
 add_arg('batch_size',        int,   96,              "Minibatch size.")
 add_arg('learning_rate',     float, 0.025,           "The start learning rate.")
@@ -140,9 +140,6 @@ def main(args):
        if args.use_data_parallel else fluid.CUDAPlace(0)

    with fluid.dygraph.guard(place):
-        if args.use_data_parallel:
-            strategy = fluid.dygraph.parallel.prepare_context()
-
        genotype = eval("genotypes.%s" % args.arch)
        model = Network(
            C=args.init_channels,
@@ -151,7 +148,12 @@ def main(args):
            auxiliary=args.auxiliary,
            genotype=genotype)

-        step_per_epoch = int(args.trainset_num / args.batch_size)
+        logger.info("param size = {:.6f}MB".format(
+            count_parameters_in_MB(model.parameters())))
+
+        device_num = fluid.dygraph.parallel.Env().nranks
+        step_per_epoch = int(args.trainset_num /
+                             (args.batch_size * device_num))
        learning_rate = fluid.dygraph.CosineDecay(args.learning_rate,
                                                  step_per_epoch, args.epochs)
        clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=args.grad_clip)
@@ -163,18 +165,21 @@ def main(args):
            grad_clip=clip)

        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
            model = fluid.dygraph.parallel.DataParallel(model, strategy)

        train_loader = fluid.io.DataLoader.from_generator(
            capacity=64,
            use_double_buffer=True,
            iterable=True,
-            return_list=True)
+            return_list=True,
+            use_multiprocess=args.use_multiprocess)
        valid_loader = fluid.io.DataLoader.from_generator(
            capacity=64,
            use_double_buffer=True,
            iterable=True,
-            return_list=True)
+            return_list=True,
+            use_multiprocess=args.use_multiprocess)

        train_reader = reader.train_valid(
            batch_size=args.batch_size,
@@ -186,13 +191,13 @@ def main(args):
            is_train=False,
            is_shuffle=False,
            args=args)
-        train_loader.set_batch_generator(train_reader, places=place)
-        valid_loader.set_batch_generator(valid_reader, places=place)
-
        if args.use_data_parallel:
            train_reader = fluid.contrib.reader.distributed_batch_reader(
                train_reader)

+        train_loader.set_batch_generator(train_reader, places=place)
+        valid_loader.set_batch_generator(valid_reader, places=place)
+
        save_parameters = (not args.use_data_parallel) or (
            args.use_data_parallel and
            fluid.dygraph.parallel.Env().local_rank == 0)

--- a/demo/darts/train_imagenet.py
+++ b/demo/darts/train_imagenet.py
@@ -19,13 +19,15 @@ from __future__ import print_function
 import os
 import sys
 import ast
+import logging
 import argparse
 import functools
-import logging

 import paddle.fluid as fluid
 from paddle.fluid.dygraph.base import to_variable
 from paddleslim.common import AvgrageMeter, get_logger
+from paddleslim.nas.darts import count_parameters_in_MB
+
 import genotypes
 import reader
 from model import NetworkImageNet as Network
@@ -66,7 +68,7 @@ add_arg('use_data_parallel', ast.literal_eval,  False, "The flag indicating whet

 def cross_entropy_label_smooth(preds, targets, epsilon):
    preds = fluid.layers.softmax(preds)
-    targets_one_hot = fluid.layers.one_hot(input=targets, depth=args.class_num)
+    targets_one_hot = fluid.one_hot(input=targets, depth=args.class_num)
    targets_smooth = fluid.layers.label_smooth(
        targets_one_hot, epsilon=epsilon, dtype="float32")
    loss = fluid.layers.cross_entropy(
@@ -152,9 +154,6 @@ def main(args):
        if args.use_data_parallel else fluid.CUDAPlace(0)

    with fluid.dygraph.guard(place):
-        if args.use_data_parallel:
-            strategy = fluid.dygraph.parallel.prepare_context()
-
        genotype = eval("genotypes.%s" % args.arch)
        model = Network(
            C=args.init_channels,
@@ -163,7 +162,12 @@ def main(args):
            auxiliary=args.auxiliary,
            genotype=genotype)

-        step_per_epoch = int(args.trainset_num / args.batch_size)
+        logger.info("param size = {:.6f}MB".format(
+            count_parameters_in_MB(model.parameters())))
+
+        device_num = fluid.dygraph.parallel.Env().nranks
+        step_per_epoch = int(args.trainset_num /
+                             (args.batch_size * device_num))
        learning_rate = fluid.dygraph.ExponentialDecay(
            args.learning_rate,
            step_per_epoch,
@@ -179,6 +183,7 @@ def main(args):
            grad_clip=clip)

        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
            model = fluid.dygraph.parallel.DataParallel(model, strategy)

        train_loader = fluid.io.DataLoader.from_generator(
@@ -199,20 +204,19 @@ def main(args):
        valid_reader = fluid.io.batch(
            reader.imagenet_reader(args.data_dir, 'val'),
            batch_size=args.batch_size)
-
-        train_loader.set_sample_list_generator(train_reader, places=place)
-        valid_loader.set_sample_list_generator(valid_reader, places=place)
-
        if args.use_data_parallel:
            train_reader = fluid.contrib.reader.distributed_batch_reader(
                train_reader)

+        train_loader.set_sample_list_generator(train_reader, places=place)
+        valid_loader.set_sample_list_generator(valid_reader, places=place)
+
        save_parameters = (not args.use_data_parallel) or (
            args.use_data_parallel and
            fluid.dygraph.parallel.Env().local_rank == 0)
        best_top1 = 0
        for epoch in range(args.epochs):
-            logging.info('Epoch {}, lr {:.6f}'.format(
+            logger.info('Epoch {}, lr {:.6f}'.format(
                epoch, optimizer.current_step_lr()))
            train_top1, train_top5 = train(model, train_loader, optimizer,
                                           epoch, args)

--- a/demo/distillation/distill.py
+++ b/demo/distillation/distill.py
@@ -133,9 +133,9 @@ def compress(args):
    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
    exe = fluid.Executor(place)

-    train_reader = paddle.batch(
+    train_reader = paddle.fluid.io.batch(
        train_reader, batch_size=args.batch_size, drop_last=True)
-    val_reader = paddle.batch(
+    val_reader = paddle.fluid.io.batch(
        val_reader, batch_size=args.batch_size, drop_last=True)
    val_program = student_program.clone(for_test=True)


--- a/demo/distillation/image_classification_distillation_tutorial.ipynb
+++ b/demo/distillation/image_classification_distillation_tutorial.ipynb
@@ -165,7 +165,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "train_reader = paddle.batch(\n",
+    "train_reader = paddle.fluid.io.batch(\n",
    "    paddle.dataset.mnist.train(), batch_size=128, drop_last=True)\n",
    "train_feeder = fluid.DataFeeder(['image', 'label'], fluid.CPUPlace(), student_program)"
   ]

--- a/demo/mkldnn_quant/quant_aware/CMakeLists.txt
+++ b/demo/mkldnn_quant/quant_aware/CMakeLists.txt
+CMAKE_MINIMUM_REQUIRED(VERSION 3.2)
+
+project(mkldnn_quantaware_demo CXX C)
+set(DEMO_SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR})
+set(DEMO_BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR})
+
+option(USE_GPU      "Compile the inference code with the support CUDA GPU" OFF)
+option(USE_PROFILER "Whether enable Paddle's profiler." OFF)
+
+set(USE_SHARED OFF)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/cmake")
+if(NOT PADDLE_ROOT)
+  set(PADDLE_ROOT ${DEMO_SOURCE_DIR}/fluid_inference)
+endif()
+find_package(Fluid)
+
+set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3")
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -std=c++11")
+
+if(USE_PROFILER)
+    find_package(Gperftools REQUIRED)
+    include_directories(${GPERFTOOLS_INCLUDE_DIR})
+    add_definitions(-DWITH_GPERFTOOLS)
+endif()
+
+include_directories(${CMAKE_CURRENT_SOURCE_DIR})
+
+if(PADDLE_FOUND)
+  add_executable(inference sample_tester.cc)
+  target_link_libraries(inference
+      ${PADDLE_LIBRARIES}
+      ${PADDLE_THIRD_PARTY_LIBRARIES}
+      rt dl pthread)
+  if (mklml_FOUND)
+    target_link_libraries(inference "-L${THIRD_PARTY_ROOT}/install/mklml/lib -liomp5 -Wl,--as-needed")
+  endif()
+else()
+  message(FATAL_ERROR "Cannot find PaddlePaddle Fluid under ${PADDLE_ROOT}")
+endif()
--- a/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial_cn.md
+++ b/demo/mkldnn_quant/quant_aware/PaddleCV_mkldnn_quantaware_tutorial_cn.md
+# 图像分类INT8模型在CPU优化部署和预测
+
+## 概述
+
+本文主要介绍在CPU上转化、部署和执行PaddleSlim产出的量化模型的流程。在Intel(R) Xeon(R) Gold 6271机器上，量化后的INT8模型为优化后FP32模型的3-4倍，而精度仅有极小下降。
+
+流程步骤如下：
+- 产出量化模型：使用PaddleSlim训练产出量化模型，注意模型的weights的值应该在INT8范围内，但是类型仍为float型。
+- CPU转换量化模型：在CPU上使用DNNL转化量化模型为真正的INT8模型
+- CPU部署预测：在CPU上部署demo应用并预测
+
+## 1. 准备
+
+#### 安装构建PaddleSlim
+
+PaddleSlim 安装请参考[官方安装文档](https://paddlepaddle.github.io/PaddleSlim/install.html)安装
+```
+git clone https://github.com/PaddlePaddle/PaddleSlim.git
+cd PaddleSlim
+python setup.py install
+```
+#### 在代码中使用
+在用户自己的测试样例中，按以下方式导入Paddle和PaddleSlim:
+```
+import paddle
+import paddle.fluid as fluid
+import paddleslim as slim
+import numpy as np
+```
+
+## 2. 用PaddleSlim产出量化模型
+
+使用PaddleSlim产出量化训练模型或者离线量化模型。
+
+#### 2.1 量化训练
+
+量化训练流程可以参考 [分类模型的量化训练流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_aware_demo/)
+
+**注意量化训练过程中config参数：**
+- **quantize_op_types:** 目前CPU上支持量化 `depthwise_conv2d`, `mul`, `conv2d`, `matmul`, `transpose2`, `reshape2`, `pool2d`, `scale`。但是训练阶段插入fake quantize/dequantize op时，只需在前四种op前后插入fake quantize/dequantize ops，因为后面四种op `matmul`, `transpose2`, `reshape2`, `pool2d`的输入输出scale不变，将从前后方op的输入输出scales获得scales,所以`quantize_op_types` 参数只需要 `depthwise_conv2d`, `mul`, `conv2d`, `matmul` 即可。
+- **其他参数:** 请参考 [PaddleSlim quant_aware API](https://paddlepaddle.github.io/PaddleSlim/api/quantization_api/#quant_aware)
+
+#### 2.2 静态离线量化
+
+静态离线量化模型产出可以参考[分类模型的静态离线量化流程](https://paddlepaddle.github.io/PaddleSlim/tutorials/quant_post_demo/#_1)
+
+## 3. 转化产出的量化模型为DNNL优化后的INT8模型
+为了部署在CPU上，我们将保存的quant模型，通过一个转化脚本，移除fake quantize/dequantize op，fuse一些op，并且完全转化成 INT8 模型。需要使用Paddle所在目录运行下面的脚本，脚本在官网的位置为[save_qat_model.py](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/save_qat_model.py)。复制脚本到demo所在目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`)并执行如下命令：
+```
+python save_qat_model.py --qat_model_path=/PATH/TO/SAVE/FLOAT32/QAT/MODEL --int8_model_save_path=/PATH/TO/SAVE/INT8/MODEL -ops_to_quantize="conv2d,pool2d"
+```
+**参数说明：**
+- **qat_model_path:** 为输入参数，必填。为量化训练产出的quant模型。
+- **int8_model_save_path:** 将quant模型经过DNNL优化量化后保存的最终INT8模型路径。注意：qat_model_path必须传入量化训练后的含有fake quant/dequant ops的quant模型
+- **ops_to_quantize:** 必填，不可以不设置。表示最终INT8模型中使用量化op的列表。图像分类模型请设置`--ops_to_quantize=“conv2d, pool2d"`。自然语言处理模型，如Ernie模型，请设置`--ops_to_quantize="fc,reshape2,transpose2,matmul"`。用户必须手动设置，因为不是量化所有可量化的op就能达到最优速度。
+  注意：
+  - 目前支持DNNL量化op列表是`conv2d`, `depthwise_conv2d`, `mul`, `fc`, `matmul`, `pool2d`, `reshape2`, `transpose2`, `concat`，只能从这个列表中选择。
+  - 量化所有可量化的Op不一定性能最优，所以用户要手动输入。比如，如果一个op是单个的INT8 op, 不可以与之前的和之后的op融合，那么为了量化这个op，需要先做quantize，然后运行INT8 op, 再dequantize, 这样可能导致最终性能不如保持该op为fp32 op。由于用户模型未知，这里不给出默认设置。图像分类和NLP任务的设置建议已给出。
+  - 一个有效找到最优配置的方法是，用户观察这个模型一共用到了哪些可量化的op，选出不同的`ops_to_quantize`组合，多运行几次。
+
+## 4. 预测
+
+### 4.1 数据预处理转化
+在精度和性能预测中，需要先对数据进行二进制转化。运行脚本如下可转化完整ILSVRC2012 val数据集。使用`--local`可以转化用户自己的数据。在Paddle所在目录运行下面的脚本。脚本在官网位置为[full_ILSVRC2012_val_preprocess.py](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py)
+```
+python Paddle/paddle/fluid/inference/tests/api/full_ILSVRC2012_val_preprocess.py --local --data_dir=/PATH/TO/USER/DATASET/  --output_file=/PATH/TO/SAVE/BINARY/FILE
+```
+
+可选参数：
+- 不设置任何参数。脚本将下载 ILSVRC2012_img_val数据集，并转化为二进制文件。
+- **local:** 设置便为true，表示用户将提供自己的数据
+- **data_dir:** 用户自己的数据目录
+- **label_list:** 图片路径-图片类别列表文件，类似于`val_list.txt`
+- **output_file:** 生成的binary文件路径。
+- **data_dim:** 预处理图片的长和宽。默认值 224。
+
+用户自己的数据集目录结构应该如下
+```
+imagenet_user
+├── val
+│   ├── ILSVRC2012_val_00000001.jpg
+│   ├── ILSVRC2012_val_00000002.jpg
+|   |── ...
+└── val_list.txt
+```
+其中，val_list.txt 内容应该如下：
+```
+val/ILSVRC2012_val_00000001.jpg 0
+val/ILSVRC2012_val_00000002.jpg 0
+```
+
+注意：
+- 为什么将数据集转化为二进制文件？因为paddle中的数据预处理（resize, crop等）都使用pythong.Image模块进行，训练出的模型也是基于Python预处理的图片，但是我们发现Python测试性能开销很大，导致预测性能下降。为了获得良好性能，在量化模型预测阶段，我们决定使用C++测试，而C++只支持Open-CV等库，Paddle不建议使用外部库，因此我们使用Python将图片预处理然后放入二进制文件，再在C++测试中读出。用户根据自己的需要，可以更改C++测试以使用open-cv库直接读数据并预处理，精度不会有太大下降。我们还提供了python测试`sample_tester.py`作为参考，与C++测试`sample_tester.cc`相比，用户可以看到Python测试更大的性能开销。
+
+### 4.2 部署预测
+
+#### 部署前提
+- 只有使用AVX512系列CPU服务器才能获得性能提升。用户可以通过在命令行红输入`lscpu`查看本机支持指令。
+- 在支持`avx512_vnni`的CPU服务器上，INT8精度最高，性能提升最快。
+
+#### 准备预测推理库
+
+用户可以从源码编译Paddle推理库，也可以直接下载推理库。
+- 用户可以从Paddle源码编译Paddle推理库，参考[从源码编译](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html#id12)，使用release/2.0以上版本。
+
+- 用户也可以从Paddle官网下载发布的[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)。请选择`ubuntu14.04_cpu_avx_mkl` 最新发布版或者develop版。
+
+你可以将准备好的预测库解压并重命名为fluid_inference，放在当前目录下(`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`)。或者在cmake时通过设置PADDLE_ROOT来指定Paddle预测库的位置。
+
+#### 编译应用
+样例所在目录为PaddleSlim下`demo/mkldnn_quant/quant_aware/`,样例`sample_tester.cc`和编译所需`cmake`文件夹都在这个目录下。
+```
+cd /PATH/TO/PaddleSlim
+cd demo/mkldnn_quant/quant_aware
+mkdir build
+cd build
+make -j
+```
+如果你从官网下载解压了[预测库](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)到当前目录下，这里`-DPADDLE_ROOT`可以不设置，因为`DPADDLE_ROOT`默认位置`demo/mkldnn_quant/quant_aware/fluid_inference`
+
+#### 运行测试
+```
+# Bind threads to cores
+export KMP_AFFINITY=granularity=fine,compact,1,0
+export KMP_BLOCKTIME=1
+# Turbo Boost could be set to OFF using the command
+echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
+# In the file run.sh, set `MODEL_DIR` to `/PATH/TO/FLOAT32/MODEL`或者`/PATH/TO/SAVE/INT8/MODEL`
+# In the file run.sh, set `DATA_FILE` to `/PATH/TO/SAVE/BINARY/FILE`
+# For 1 thread performance:
+./run.sh
+# For 20 thread performance:
+./run.sh -1 20
+```
+
+运行时需要配置以下参数：
+- **infer_model:** 模型所在目录，注意模型参数当前必须是分开保存成多个文件的。可以设置为`PATH/TO/SAVE/INT8/MODEL`, `PATH/TO/SAVE/FLOAT32/MODEL`。无默认值。
+- **infer_data:** 测试数据文件所在路径。注意需要是经`full_ILSVRC2012_val_preprocess`转化后的binary文件。
+- **batch_size:** 预测batch size大小。默认值为50。
+- **iterations:** 预测多少batches。默认为0，表示预测infer_data中所有batches (image numbers/batch size)
+- **num_threads:** 预测使用CPU 线程数，默认为单核一个线程。
+- **with_accuracy_layer:** 由于这个测试是Image Classification通用的测试，既可以测试float32模型也可以INT8模型，模型可以包含或者不包含label层，设置此参数更改。
+- **optimize_fp32_model** 是否优化测试FP32模型。样例可以测试保存的INT8模型，也可以优化（fuses等）并测试优化后的FP32模型。默认为False，表示测试转化好的INT8模型，此处无需优化。
+- **use_profile:** 由Paddle预测库中提供，设置用来进行性能分析。默认值为false。
+
+你可以直接修改`/PATH_TO_PaddleSlim/demo/mkldnn_quant/quant_aware/`目录下的`run.sh`中的MODEL_DIR和DATA_DIR，即可执行`./run.sh`进行CPU预测。
+
+### 4.3 用户编写自己的测试：
+如果用户编写自己的测试：
+1. 测试INT8模型
+    如果用户测试转化好的INT8模型，使用 paddle::NativeConfig 即可测试。在demo中，设置`optimize_fp32_model`为false。
+2. 测试FP32模型
+   如果用户要测试PF32模型，可以使用AnalysisConfig对原始FP32模型先优化（fuses等）再测试。AnalysisConfig配置设置如下：
+```
+static void SetConfig(paddle::AnalysisConfig *cfg) {
+  cfg->SetModel(FLAGS_infer_model);  // 必须。表示需要测试的模型
+  cfg->DisableGpu();      // 必须。部署在CPU上预测，必须Disablegpu
+  cfg->EnableMKLDNN();  //必须。表示使用MKLDNN算子，将比 native 快
+  cfg->SwitchIrOptim();   // 如果传入FP32原始，这个配置设置为true将优化加速模型（如进行fuses等）
+  cfg->SetCpuMathLibraryNumThreads(FLAGS_num_threads);  //默认设置为1。表示多线程运行
+  if(FLAGS_use_profile){
+      cfg->EnableProfile();  // 可选。如果设置use_profile，运行结束将展现各个算子所占用时间
+  }
+}
+
+```
+在我们提供的样例中，只要设置`optimize_fp32_model`为true,`infer_model`传入原始FP32模型，AnalysisConfig的上述设置将被执行，传入的FP32模型将被DNNL优化加速（包括fuses等）。
+如果infer_model传入INT8模型，则optimize_fp32_model将不起作用，因为INT8模型已经被优化量化。
+如果infer_model传入PaddleSlim产出的模型，optimize_fp32_model也不起作用，因为quant模型包含fake quantize/dequantize ops,无法fuse,无法优化。
+
+## 5. 精度和性能数据
+INT8模型精度和性能结果参考[CPU部署预测INT8模型的精度和性能](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md)
+
+## FAQ
+
+- 自然语言处理模型在CPU上的部署和预测参考样例[ERNIE 模型 QAT INT8 精度与性能复现](https://github.com/PaddlePaddle/benchmark/tree/master/Inference/c%2B%2B/ernie/mkldnn)
+- 具体DNNL优化原理可以查看[SLIM QAT for INT8 DNNL](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/contrib/slim/tests/QAT_mkldnn_int8_readme.md)。
--- a/demo/mkldnn_quant/quant_aware/cmake/FindFluid.cmake
+++ b/demo/mkldnn_quant/quant_aware/cmake/FindFluid.cmake
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+set(PADDLE_FOUND OFF)
+
+if(NOT PADDLE_ROOT)
+  set(PADDLE_ROOT $ENV{PADDLE_ROOT} CACHE PATH "Paddle Path")
+endif()
+if(NOT PADDLE_ROOT)
+  message(FATAL_ERROR "Set PADDLE_ROOT as your root directory installed PaddlePaddle")
+endif()
+set(THIRD_PARTY_ROOT ${PADDLE_ROOT}/third_party)
+
+if(USE_GPU)
+  set(CUDA_ROOT $ENV{CUDA_ROOT} CACHE PATH "CUDA root Path")
+  set(CUDNN_ROOT $ENV{CUDNN_ROOT} CACHE PATH "CUDNN root Path")
+endif()
+
+# Support directory orgnizations
+find_path(PADDLE_INC_DIR NAMES paddle_inference_api.h PATHS ${PADDLE_ROOT}/paddle/include)
+if(PADDLE_INC_DIR)
+  set(LIB_PATH "paddle/lib")
+else()
+  find_path(PADDLE_INC_DIR NAMES paddle/fluid/inference/paddle_inference_api.h PATHS ${PADDLE_ROOT})
+  if(PADDLE_INC_DIR)
+    include_directories(${PADDLE_ROOT}/paddle/fluid/inference)
+  endif()
+  set(LIB_PATH "paddle/fluid/inference")
+endif()
+  
+include_directories(${PADDLE_INC_DIR})
+
+find_library(PADDLE_FLUID_SHARED_LIB NAMES "libpaddle_fluid.so" PATHS
+    ${PADDLE_ROOT}/${LIB_PATH})
+find_library(PADDLE_FLUID_STATIC_LIB NAMES "libpaddle_fluid.a" PATHS
+    ${PADDLE_ROOT}/${LIB_PATH})
+
+if(USE_SHARED AND PADDLE_INC_DIR AND PADDLE_FLUID_SHARED_LIB)
+  set(PADDLE_FOUND ON)
+  add_library(paddle_fluid_shared SHARED IMPORTED)
+  set_target_properties(paddle_fluid_shared PROPERTIES IMPORTED_LOCATION
+                        ${PADDLE_FLUID_SHARED_LIB})
+  set(PADDLE_LIBRARIES paddle_fluid_shared)
+  message(STATUS "Found PaddlePaddle Fluid (include: ${PADDLE_INC_DIR}; "
+          "library: ${PADDLE_FLUID_SHARED_LIB}")
+elseif(PADDLE_INC_DIR AND PADDLE_FLUID_STATIC_LIB)
+  set(PADDLE_FOUND ON)
+  add_library(paddle_fluid_static STATIC IMPORTED)
+  set_target_properties(paddle_fluid_static PROPERTIES IMPORTED_LOCATION
+                        ${PADDLE_FLUID_STATIC_LIB})
+  set(PADDLE_LIBRARIES paddle_fluid_static)
+  message(STATUS "Found PaddlePaddle Fluid (include: ${PADDLE_INC_DIR}; "
+          "library: ${PADDLE_FLUID_STATIC_LIB}")
+else()
+  set(PADDLE_FOUND OFF)
+  message(WARNING "Cannot find PaddlePaddle Fluid under ${PADDLE_ROOT}")
+  return()
+endif()
+
+
+# including directory of third_party libraries
+set(PADDLE_THIRD_PARTY_INC_DIRS)
+function(third_party_include TARGET_NAME HEADER_NAME TARGET_DIRNAME)
+  find_path(PADDLE_${TARGET_NAME}_INC_DIR NAMES ${HEADER_NAME} PATHS
+            ${TARGET_DIRNAME}
+            NO_DEFAULT_PATH)
+  if(PADDLE_${TARGET_NAME}_INC_DIR)
+    message(STATUS "Found PaddlePaddle third_party including directory: " ${PADDLE_${TARGET_NAME}_INC_DIR})
+    set(PADDLE_THIRD_PARTY_INC_DIRS ${PADDLE_THIRD_PARTY_INC_DIRS} ${PADDLE_${TARGET_NAME}_INC_DIR} PARENT_SCOPE)
+  endif()
+endfunction()
+
+third_party_include(glog glog/logging.h ${THIRD_PARTY_ROOT}/install/glog/include)
+third_party_include(protobuf google/protobuf/message.h ${THIRD_PARTY_ROOT}/install/protobuf/include)
+third_party_include(gflags gflags/gflags.h ${THIRD_PARTY_ROOT}/install/gflags/include)
+third_party_include(eigen unsupported/Eigen/CXX11/Tensor ${THIRD_PARTY_ROOT}/eigen3)
+third_party_include(boost boost/config.hpp ${THIRD_PARTY_ROOT}/boost)
+if(USE_GPU)
+  third_party_include(cuda cuda.h ${CUDA_ROOT}/include)
+  third_party_include(cudnn cudnn.h ${CUDNN_ROOT}/include)
+endif()
+
+message(STATUS "PaddlePaddle need to include these third party directories: ${PADDLE_THIRD_PARTY_INC_DIRS}")
+include_directories(${PADDLE_THIRD_PARTY_INC_DIRS})
+
+set(PADDLE_THIRD_PARTY_LIBRARIES)
+function(third_party_library TARGET_NAME TARGET_DIRNAME)
+  set(library_names ${ARGN})
+  set(local_third_party_libraries)
+  foreach(lib ${library_names})
+    string(REGEX REPLACE "^lib" "" lib_noprefix ${lib})
+    if(${lib} MATCHES "${CMAKE_STATIC_LIBRARY_SUFFIX}$")
+      set(libtype STATIC)
+      string(REGEX REPLACE "${CMAKE_STATIC_LIBRARY_SUFFIX}$" "" libname ${lib_noprefix})
+    elseif(${lib} MATCHES "${CMAKE_SHARED_LIBRARY_SUFFIX}(\\.[0-9]+)?$")
+      set(libtype SHARED)
+      string(REGEX REPLACE "${CMAKE_SHARED_LIBRARY_SUFFIX}(\\.[0-9]+)?$" "" libname ${lib_noprefix})
+    else()
+      message(FATAL_ERROR "Unknown library type: ${lib}")
+    endif()
+    #message(STATUS "libname: ${libname}")
+    find_library(${libname}_LIBRARY NAMES "${lib}" PATHS
+        ${TARGET_DIRNAME}
+        NO_DEFAULT_PATH)
+    if(${libname}_LIBRARY)
+      set(${TARGET_NAME}_FOUND ON PARENT_SCOPE)
+      add_library(${libname} ${libtype} IMPORTED)
+      set_target_properties(${libname} PROPERTIES IMPORTED_LOCATION ${${libname}_LIBRARY})
+      set(local_third_party_libraries ${local_third_party_libraries} ${libname})
+      message(STATUS "Found PaddlePaddle third_party library: " ${${libname}_LIBRARY})
+    else()
+      set(${TARGET_NAME}_FOUND OFF PARENT_SCOPE)
+      message(WARNING "Cannot find ${lib} under ${THIRD_PARTY_ROOT}")
+    endif()
+  endforeach()
+  set(PADDLE_THIRD_PARTY_LIBRARIES ${PADDLE_THIRD_PARTY_LIBRARIES} ${local_third_party_libraries} PARENT_SCOPE)
+endfunction()
+
+third_party_library(mklml ${THIRD_PARTY_ROOT}/install/mklml/lib libiomp5.so libmklml_intel.so)
+third_party_library(mkldnn ${THIRD_PARTY_ROOT}/install/mkldnn/lib libmkldnn.so)
+if(NOT mkldnn_FOUND)
+  third_party_library(mkldnn ${THIRD_PARTY_ROOT}/install/mkldnn/lib libmkldnn.so.0)
+endif()
+if(NOT USE_SHARED)
+  third_party_library(glog ${THIRD_PARTY_ROOT}/install/glog/lib libglog.a)
+  third_party_library(protobuf ${THIRD_PARTY_ROOT}/install/protobuf/lib libprotobuf.a)
+  third_party_library(gflags ${THIRD_PARTY_ROOT}/install/gflags/lib libgflags.a)
+  if(NOT mklml_FOUND)
+    third_party_library(openblas ${THIRD_PARTY_ROOT}/install/openblas/lib libopenblas.a)
+  endif()
+  third_party_library(zlib ${THIRD_PARTY_ROOT}/install/zlib/lib libz.a)
+  third_party_library(snappystream ${THIRD_PARTY_ROOT}/install/snappystream/lib libsnappystream.a)
+  third_party_library(snappy ${THIRD_PARTY_ROOT}/install/snappy/lib libsnappy.a)
+  third_party_library(xxhash ${THIRD_PARTY_ROOT}/install/xxhash/lib libxxhash.a)
+  if(USE_GPU)
+    third_party_library(cudart ${CUDA_ROOT}/lib64 libcudart.so)
+  endif()
+endif()
\ No newline at end of file
--- a/demo/mkldnn_quant/quant_aware/cmake/FindGperftools.cmake
+++ b/demo/mkldnn_quant/quant_aware/cmake/FindGperftools.cmake
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+# Tries to find Gperftools.
+#
+# Usage of this module as follows:
+#
+#     find_package(Gperftools)
+#
+# Variables used by this module, they can change the default behaviour and need
+# to be set before calling find_package:
+#
+#  Gperftools_ROOT_DIR  Set this variable to the root installation of
+#                       Gperftools if the module has problems finding
+#                       the proper installation path.
+#
+# Variables defined by this module:
+#
+#  GPERFTOOLS_FOUND              System has Gperftools libs/headers
+#  GPERFTOOLS_LIBRARIES          The Gperftools libraries (tcmalloc & profiler)
+#  GPERFTOOLS_INCLUDE_DIR        The location of Gperftools headers
+
+find_library(GPERFTOOLS_TCMALLOC
+  NAMES tcmalloc
+  HINTS ${Gperftools_ROOT_DIR}/lib)
+
+find_library(GPERFTOOLS_PROFILER
+  NAMES profiler
+  HINTS ${Gperftools_ROOT_DIR}/lib)
+
+find_library(GPERFTOOLS_TCMALLOC_AND_PROFILER
+  NAMES tcmalloc_and_profiler
+  HINTS ${Gperftools_ROOT_DIR}/lib)
+
+find_path(GPERFTOOLS_INCLUDE_DIR
+  NAMES gperftools/heap-profiler.h
+  HINTS ${Gperftools_ROOT_DIR}/include)
+
+set(GPERFTOOLS_LIBRARIES ${GPERFTOOLS_TCMALLOC_AND_PROFILER})
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(
+  Gperftools
+  DEFAULT_MSG
+  GPERFTOOLS_LIBRARIES
+  GPERFTOOLS_INCLUDE_DIR)
+
+mark_as_advanced(
+  Gperftools_ROOT_DIR
+  GPERFTOOLS_TCMALLOC
+  GPERFTOOLS_PROFILER
+  GPERFTOOLS_TCMALLOC_AND_PROFILER
+  GPERFTOOLS_LIBRARIES
+  GPERFTOOLS_INCLUDE_DIR)
+
+# create IMPORTED targets
+if (Gperftools_FOUND AND NOT TARGET gperftools::tcmalloc)
+  add_library(gperftools::tcmalloc UNKNOWN IMPORTED)
+  set_target_properties(gperftools::tcmalloc PROPERTIES
+    IMPORTED_LOCATION ${GPERFTOOLS_TCMALLOC}
+    INTERFACE_INCLUDE_DIRECTORIES "${GPERFTOOLS_INCLUDE_DIR}")
+  add_library(gperftools::profiler UNKNOWN IMPORTED)
+  set_target_properties(gperftools::profiler PROPERTIES
+    IMPORTED_LOCATION ${GPERFTOOLS_PROFILER}
+    INTERFACE_INCLUDE_DIRECTORIES "${GPERFTOOLS_INCLUDE_DIR}")
+endif()
--- a/demo/mkldnn_quant/quant_aware/run.sh
+++ b/demo/mkldnn_quant/quant_aware/run.sh
+#!/bin/bash
+MODEL_DIR=$HOME/repo/Paddle/resnet50_quant_int8
+DATA_FILE=$HOME/.cache/paddle/dataset/int8/download/int8_full_val.bin
+num_threads=10
+with_accuracy_layer=false
+use_profile=true
+ITERATIONS=0
+
+./build/inference --logtostderr=1 \
+    --infer_model=${MODEL_DIR} \
+    --infer_data=${DATA_FILE} \
+    --batch_size=1 \
+    --num_threads=${num_threads} \
+    --iterations=${ITERATIONS} \
+    --with_accuracy_layer=${with_accuracy_layer} \
+    --use_profile=${use_profile} \
+    --optimize_fp32_model=false
--- a/demo/mkldnn_quant/quant_aware/sample_tester.cc
+++ b/demo/mkldnn_quant/quant_aware/sample_tester.cc
+/* Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <gflags/gflags.h>
+#include <glog/logging.h>
+#include <paddle_inference_api.h>
+#include <algorithm>
+#include <chrono>
+#include <fstream>
+#include <iomanip>
+#include <iostream>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <vector>
+#ifdef WITH_GPERFTOOLS
+#include <gperftools/profiler.h>
+#include <paddle/fluid/platform/profiler.h>
+#endif
+
+DEFINE_string(infer_model, "", "path to the model");
+DEFINE_string(infer_data, "", "path to the input data");
+DEFINE_int32(batch_size, 50, "inference batch size");
+DEFINE_int32(iterations,
+             0,
+             "number of batches to process. 0 means testing whole dataset");
+DEFINE_int32(num_threads, 1, "num of threads to run in parallel");
+DEFINE_bool(with_accuracy_layer,
+            true,
+            "Set with_accuracy_layer to true if provided model has accuracy layer and requires label input");
+DEFINE_bool(use_profile, false, "Set use_profile to true to get profile information");
+DEFINE_bool(optimize_fp32_model, false, "If optimize_fp32_model is set to true, fp32 model will be optimized");
+
+struct Timer {
+  std::chrono::high_resolution_clock::time_point start;
+  std::chrono::high_resolution_clock::time_point startu;
+
+  void tic() { start = std::chrono::high_resolution_clock::now(); }
+  double toc() {
+    startu = std::chrono::high_resolution_clock::now();
+    std::chrono::duration<double> time_span =
+        std::chrono::duration_cast<std::chrono::duration<double>>(startu -
+                                                                  start);
+    double used_time_ms = static_cast<double>(time_span.count()) * 1000.0;
+    return used_time_ms;
+  }
+};
+
+template <typename T>
+constexpr paddle::PaddleDType GetPaddleDType();
+
+template <>
+constexpr paddle::PaddleDType GetPaddleDType<int64_t>() {
+  return paddle::PaddleDType::INT64;
+}
+
+template <>
+constexpr paddle::PaddleDType GetPaddleDType<float>() {
+  return paddle::PaddleDType::FLOAT32;
+}
+
+template <typename T>
+class TensorReader {
+ public:
+  TensorReader(std::ifstream &file,
+               size_t beginning_offset,
+               std::vector<int> shape,
+               std::string name)
+      : file_(file), position_(beginning_offset), shape_(shape), name_(name) {
+    numel_ = std::accumulate(
+        shape_.begin(), shape_.end(), size_t{1}, std::multiplies<size_t>());
+  }
+
+  paddle::PaddleTensor NextBatch() {
+    paddle::PaddleTensor tensor;
+    tensor.name = name_;
+    tensor.shape = shape_;
+    tensor.dtype = GetPaddleDType<T>();
+    tensor.data.Resize(numel_ * sizeof(T));
+    file_.seekg(position_);
+    file_.read(static_cast<char *>(tensor.data.data()), numel_ * sizeof(T));
+    position_ = file_.tellg();
+    if (file_.eof()) LOG(ERROR) << name_ << ": reached end of stream";
+    if (file_.bad()) LOG(ERROR) << name_ << "ERROR: badbit is true";
+    if (file_.fail())
+      throw std::runtime_error(name_ + ": failed reading file.");
+    return tensor;
+  }
+
+ protected:
+  std::ifstream &file_;
+  size_t position_;
+  std::vector<int> shape_;
+  std::string name_;
+  size_t numel_;
+};
+
+void SetInput(std::vector<std::vector<paddle::PaddleTensor>> *inputs,
+              std::vector<paddle::PaddleTensor> *labels_gt,
+              bool with_accuracy_layer = FLAGS_with_accuracy_layer,
+              int32_t batch_size = FLAGS_batch_size) {
+  std::ifstream file(FLAGS_infer_data, std::ios::binary);
+  if (!file) {
+    throw std::runtime_error("Couldn't open file: " + FLAGS_infer_data);
+  }
+
+  int64_t total_images{0};
+  file.seekg(0, std::ios::beg);
+  file.read(reinterpret_cast<char *>(&total_images), sizeof(total_images));
+  LOG(INFO) << "Total images in file: " << total_images;
+
+  std::vector<int> image_batch_shape{batch_size, 3, 224, 224};
+  std::vector<int> label_batch_shape{batch_size, 1};
+  auto images_offset_in_file = static_cast<size_t>(file.tellg());
+
+  TensorReader<float> image_reader(
+      file, images_offset_in_file, image_batch_shape, "image");
+
+  auto iterations_max = total_images / batch_size;
+  auto iterations = iterations_max;
+  if (FLAGS_iterations > 0 && FLAGS_iterations < iterations_max) {
+    iterations = FLAGS_iterations;
+  }
+
+  auto labels_offset_in_file =
+      images_offset_in_file + sizeof(float) * total_images * 3 * 224 * 224;
+
+  TensorReader<int64_t> label_reader(
+      file, labels_offset_in_file, label_batch_shape, "label");
+  for (auto i = 0; i < iterations; i++) {
+    auto images = image_reader.NextBatch();
+    std::vector<paddle::PaddleTensor> tmp_vec;
+    tmp_vec.push_back(std::move(images));
+    auto labels = label_reader.NextBatch();
+    if (with_accuracy_layer) {
+      tmp_vec.push_back(std::move(labels));
+    } else {
+      labels_gt->push_back(std::move(labels));
+    }
+    inputs->push_back(std::move(tmp_vec));
+  }
+}
+
+static void PrintTime(int batch_size,
+                      int num_threads,
+                      double batch_latency,
+                      int epoch = 1) {
+  double sample_latency = batch_latency / batch_size;
+  LOG(INFO) <<"Model: "<<FLAGS_infer_model;
+  LOG(INFO) << "====== num of threads: " << num_threads << " ======";
+  LOG(INFO) << "====== batch size: " << batch_size << ", iterations: " << epoch;
+  LOG(INFO) << "====== batch latency: " << batch_latency
+            << "ms, number of samples: " << batch_size * epoch;
+  LOG(INFO) << ", sample latency: " << sample_latency
+            << "ms, fps: " << 1000.f / sample_latency << " ======";
+}
+
+void PredictionRun(paddle::PaddlePredictor *predictor,
+                   const std::vector<std::vector<paddle::PaddleTensor>> &inputs,
+                   std::vector<std::vector<paddle::PaddleTensor>> *outputs,
+                   int num_threads,
+                   float *sample_latency = nullptr) {
+  int iterations = inputs.size();  // process the whole dataset ...
+  if (FLAGS_iterations > 0 &&
+      FLAGS_iterations < static_cast<int64_t>(inputs.size()))
+    iterations =
+        FLAGS_iterations;  // ... unless the number of iterations is set
+  outputs->resize(iterations);
+  Timer run_timer;
+  double elapsed_time = 0;
+#ifdef WITH_GPERFTOOLS
+  ResetProfiler();
+  ProfilerStart("paddle_inference.prof");
+#endif
+  int predicted_num = 0;
+
+  for (int i = 0; i < iterations; i++) {
+    run_timer.tic();
+    predictor->Run(inputs[i], &(*outputs)[i], FLAGS_batch_size);
+    elapsed_time += run_timer.toc();
+
+    predicted_num += FLAGS_batch_size;
+    if (predicted_num % 100 == 0) {
+      LOG(INFO) << "Infer " << predicted_num << " samples";
+    }
+  }
+
+#ifdef WITH_GPERFTOOLS
+  ProfilerStop();
+#endif
+
+  auto batch_latency = elapsed_time / iterations;
+  PrintTime(FLAGS_batch_size, num_threads, batch_latency, iterations);
+
+  if (sample_latency != nullptr)
+    *sample_latency = batch_latency / FLAGS_batch_size;
+}
+
+std::pair<float, float> CalculateAccuracy(
+    const std::vector<std::vector<paddle::PaddleTensor>> &outputs,
+    const std::vector<paddle::PaddleTensor> &labels_gt,
+    bool with_accuracy = FLAGS_with_accuracy_layer) {
+  LOG_IF(ERROR, !with_accuracy && labels_gt.size() == 0)
+      << "if with_accuracy set to false, labels_gt must be not empty";
+  std::vector<float> acc1_ss;
+  std::vector<float> acc5_ss;
+  if (!with_accuracy) { // model with_accuracy_layer = false
+    float *result_array;    // for one batch 50*1000
+    int64_t *batch_labels;  // 50*1
+    LOG_IF(ERROR, outputs.size() != labels_gt.size())
+        << "outputs first dimension must be equal to labels_gt first dimension";
+    for (auto i = 0; i < outputs.size();
+         ++i) {  // same as labels first dimension
+      result_array = static_cast<float *>(outputs[i][0].data.data());
+      batch_labels = static_cast<int64_t *>(labels_gt[i].data.data());
+      int correct_1 = 0, correct_5 = 0, total = FLAGS_batch_size;
+      for (auto j = 0; j < FLAGS_batch_size; j++) {  // batch_size
+        std::vector<float> v(result_array + j * 1000,
+                             result_array + (j + 1) * 1000);
+        std::vector<std::pair<float, int>> vx;
+        for (int k = 0; k < 1000; k++) {
+          vx.push_back(std::make_pair(v[k], k));
+        }
+        std::partial_sort(vx.begin(),
+                          vx.begin() + 5,
+                          vx.end(),
+                          [](std::pair<float, int> a, std::pair<float, int> b) {
+                            return a.first > b.first;
+                          });
+        if (static_cast<int>(batch_labels[j]) == vx[0].second) correct_1 += 1;
+        if (std::find_if(vx.begin(),
+                         vx.begin() + 5,
+                         [batch_labels, j](std::pair<float, int> a) {
+                           return static_cast<int>(batch_labels[j]) == a.second;
+                         }) != vx.begin() + 5)
+          correct_5 += 1;
+      }
+      acc1_ss.push_back(static_cast<float>(correct_1) /
+                        static_cast<float>(total));
+      acc5_ss.push_back(static_cast<float>(correct_5) /
+                        static_cast<float>(total));
+    }
+  } else {  // model with_accuracy_layer = true
+    for (auto i = 0; i < outputs.size(); ++i) {
+      LOG_IF(ERROR, outputs[i].size() < 3UL) << "To get top1 and top5 "
+                                                "accuracy, output[i] size must "
+                                                "be bigger than or equal to 3";
+      acc1_ss.push_back(
+          *static_cast<float *>(outputs[i][1].data.data()));  // 1 is top1 acc
+      acc5_ss.push_back(*static_cast<float *>(
+          outputs[i][2].data.data()));  // 2 is top5 acc or mAP
+    }
+  }
+  auto acc1_ss_avg =
+      std::accumulate(acc1_ss.begin(), acc1_ss.end(), 0.0) / acc1_ss.size();
+  auto acc5_ss_avg =
+      std::accumulate(acc5_ss.begin(), acc5_ss.end(), 0.0) / acc5_ss.size();
+  return std::make_pair(acc1_ss_avg, acc5_ss_avg);
+}
+
+static void SetIrOptimConfig(paddle::AnalysisConfig *cfg) {
+  cfg->DisableGpu();
+  cfg->SwitchIrOptim();
+  cfg->EnableMKLDNN();
+  if(FLAGS_use_profile){
+      cfg->EnableProfile();
+  }
+}
+
+std::unique_ptr<paddle::PaddlePredictor> CreatePredictor(
+    const paddle::PaddlePredictor::Config *config, bool use_analysis = true) {
+  const auto *analysis_config =
+      reinterpret_cast<const paddle::AnalysisConfig *>(config);
+  if (use_analysis) {
+    return paddle::CreatePaddlePredictor<paddle::AnalysisConfig>(
+        *analysis_config);
+  }
+  auto native_config = analysis_config->ToNativeConfig();
+  return paddle::CreatePaddlePredictor<paddle::NativeConfig>(native_config);
+}
+
+int main(int argc, char *argv[]) {
+  // InitFLAGS(argc, argv);
+  google::InitGoogleLogging(*argv);
+  gflags::ParseCommandLineFlags(&argc, &argv, true);
+  paddle::AnalysisConfig cfg;
+  cfg.SetModel(FLAGS_infer_model);
+  cfg.SetCpuMathLibraryNumThreads(FLAGS_num_threads);
+  if (FLAGS_optimize_fp32_model){
+    SetIrOptimConfig(&cfg);
+  }
+
+  std::vector<std::vector<paddle::PaddleTensor>> input_slots_all;
+  std::vector<std::vector<paddle::PaddleTensor>> outputs;
+  std::vector<paddle::PaddleTensor> labels_gt;  // optional
+  SetInput(&input_slots_all, &labels_gt);       // iterations*batch_size
+  auto predictor = CreatePredictor(reinterpret_cast<paddle::PaddlePredictor::Config *>(&cfg), FLAGS_optimize_fp32_model);
+  PredictionRun(predictor.get(), input_slots_all, &outputs, FLAGS_num_threads);
+  auto acc_pair = CalculateAccuracy(outputs, labels_gt);
+  LOG(INFO) <<"Top1 accuracy: " << std::fixed << std::setw(6)
+            <<std::setprecision(4) << acc_pair.first;
+  LOG(INFO) <<"Top5 accuracy: " << std::fixed << std::setw(6)
+            <<std::setprecision(4) << acc_pair.second;
+}
--- a/demo/mkldnn_quant/quant_aware/sample_tester.py
+++ b/demo/mkldnn_quant/quant_aware/sample_tester.py
+#   copyright (c) 2020 paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license");
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+import unittest
+import os
+import sys
+import argparse
+import logging
+import struct
+import six
+import numpy as np
+import time
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.framework import IrGraph
+from paddle.fluid import core
+
+logging.basicConfig(format='%(asctime)s-%(levelname)s: %(message)s')
+_logger = logging.getLogger(__name__)
+_logger.setLevel(logging.INFO)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--batch_size', type=int, default=1, help='Batch size.')
+    parser.add_argument(
+        '--skip_batch_num',
+        type=int,
+        default=0,
+        help='Number of the first minibatches to skip in performance statistics.'
+    )
+    parser.add_argument(
+        '--infer_model',
+        type=str,
+        default='',
+        help='A path to an Inference model.')
+    parser.add_argument(
+        '--infer_data', type=str, default='', help='Data file.')
+    parser.add_argument(
+        '--batch_num',
+        type=int,
+        default=0,
+        help='Number of batches to process. 0 or less means whole dataset. Default: 0.'
+    )
+    parser.add_argument(
+        '--with_accuracy_layer',
+        type=bool,
+        default=False,
+        help='The model is with accuracy or without accuracy layer')
+    test_args, args = parser.parse_known_args(namespace=unittest)
+    return test_args, sys.argv[:1] + args
+
+
+class SampleTester(unittest.TestCase):
+    def _reader_creator(self, data_file='data.bin'):
+        def reader():
+            with open(data_file, 'rb') as fp:
+                num = fp.read(8)
+                num = struct.unpack('q', num)[0]
+                imgs_offset = 8
+                img_ch = 3
+                img_w = 224
+                img_h = 224
+                img_pixel_size = 4
+                img_size = img_ch * img_h * img_w * img_pixel_size
+                label_size = 8
+                labels_offset = imgs_offset + num * img_size
+
+                step = 0
+                while step < num:
+                    fp.seek(imgs_offset + img_size * step)
+                    img = fp.read(img_size)
+                    img = struct.unpack_from(
+                        '{}f'.format(img_ch * img_w * img_h), img)
+                    img = np.array(img)
+                    img.shape = (img_ch, img_w, img_h)
+                    fp.seek(labels_offset + label_size * step)
+                    label = fp.read(label_size)
+                    label = struct.unpack('q', label)[0]
+                    yield img, int(label)
+                    step += 1
+
+        return reader
+
+    def _get_batch_accuracy(self, batch_output=None, labels=None):
+        total = 0
+        correct = 0
+        correct_5 = 0
+        for n, result in enumerate(batch_output):
+            index = result.argsort()
+            top_1_index = index[-1]
+            top_5_index = index[-5:]
+            total += 1
+            if top_1_index == labels[n]:
+                correct += 1
+            if labels[n] in top_5_index:
+                correct_5 += 1
+        acc1 = float(correct) / float(total)
+        acc5 = float(correct_5) / float(total)
+        return acc1, acc5
+
+    def _prepare_for_fp32_mkldnn(self, graph):
+        ops = graph.all_op_nodes()
+        for op_node in ops:
+            name = op_node.name()
+            if name in ['depthwise_conv2d']:
+                input_var_node = graph._find_node_by_name(
+                    op_node.inputs, op_node.input("Input")[0])
+                weight_var_node = graph._find_node_by_name(
+                    op_node.inputs, op_node.input("Filter")[0])
+                output_var_node = graph._find_node_by_name(
+                    graph.all_var_nodes(), op_node.output("Output")[0])
+                attrs = {
+                    name: op_node.op().attr(name)
+                    for name in op_node.op().attr_names()
+                }
+
+                conv_op_node = graph.create_op_node(
+                    op_type='conv2d',
+                    attrs=attrs,
+                    inputs={
+                        'Input': input_var_node,
+                        'Filter': weight_var_node
+                    },
+                    outputs={'Output': output_var_node})
+
+                graph.link_to(input_var_node, conv_op_node)
+                graph.link_to(weight_var_node, conv_op_node)
+                graph.link_to(conv_op_node, output_var_node)
+                graph.safe_remove_nodes(op_node)
+
+        return graph
+
+    def _predict(self,
+                 test_reader=None,
+                 model_path=None,
+                 with_accuracy_layer=False,
+                 batch_size=1,
+                 batch_num=1,
+                 skip_batch_num=0):
+        place = fluid.CPUPlace()
+        exe = fluid.Executor(place)
+        inference_scope = fluid.executor.global_scope()
+        with fluid.scope_guard(inference_scope):
+            if os.path.exists(os.path.join(model_path, '__model__')):
+                [inference_program, feed_target_names, fetch_targets
+                 ] = fluid.io.load_inference_model(model_path, exe)
+            else:
+                [inference_program, feed_target_names,
+                 fetch_targets] = fluid.io.load_inference_model(
+                     model_path, exe, 'model', 'params')
+
+            graph = IrGraph(core.Graph(inference_program.desc), for_test=True)
+
+            graph = self._prepare_for_fp32_mkldnn(graph)
+
+            inference_program = graph.to_program()
+
+            dshape = [3, 224, 224]
+            outputs = []
+            infer_accs1 = []
+            infer_accs5 = []
+            batch_acc1 = 0.0
+            batch_acc5 = 0.0
+            fpses = []
+            batch_times = []
+            batch_time = 0.0
+            total_samples = 0
+            iters = 0
+            infer_start_time = time.time()
+            for data in test_reader():
+                if batch_num > 0 and iters >= batch_num:
+                    break
+                if iters == skip_batch_num:
+                    total_samples = 0
+                    infer_start_time = time.time()
+                if six.PY2:
+                    images = map(lambda x: x[0].reshape(dshape), data)
+                if six.PY3:
+                    images = list(map(lambda x: x[0].reshape(dshape), data))
+                images = np.array(images).astype('float32')
+                labels = np.array([x[1] for x in data]).astype('int64')
+
+                if (with_accuracy_layer == False):
+                    # models that do not have accuracy measuring layers
+                    start = time.time()
+                    out = exe.run(inference_program,
+                                  feed={feed_target_names[0]: images},
+                                  fetch_list=fetch_targets)
+                    batch_time = (time.time() - start) * 1000  # in miliseconds
+                    outputs.append(out[0])
+                    # Calculate accuracy result
+                    batch_acc1, batch_acc5 = self._get_batch_accuracy(out[0],
+                                                                      labels)
+                else:
+                    # models have accuracy measuring layers
+                    labels = labels.reshape([-1, 1])
+                    start = time.time()
+                    out = exe.run(inference_program,
+                                  feed={
+                                      feed_target_names[0]: images,
+                                      feed_target_names[1]: labels
+                                  },
+                                  fetch_list=fetch_targets)
+                    batch_time = (time.time() - start) * 1000  # in miliseconds
+                    batch_acc1, batch_acc5 = out[1][0], out[2][0]
+                    outputs.append(batch_acc1)
+                infer_accs1.append(batch_acc1)
+                infer_accs5.append(batch_acc5)
+                samples = len(data)
+                total_samples += samples
+                batch_times.append(batch_time)
+                fps = samples / batch_time * 1000
+                fpses.append(fps)
+                iters += 1
+                appx = ' (warm-up)' if iters <= skip_batch_num else ''
+                _logger.info('batch {0}{5}, acc1: {1:.4f}, acc5: {2:.4f}, '
+                             'latency: {3:.4f} ms, fps: {4:.2f}'.format(
+                                 iters, batch_acc1, batch_acc5, batch_time /
+                                 batch_size, fps, appx))
+
+            # Postprocess benchmark data
+            batch_latencies = batch_times[skip_batch_num:]
+            batch_latency_avg = np.average(batch_latencies)
+            latency_avg = batch_latency_avg / batch_size
+            fpses = fpses[skip_batch_num:]
+            fps_avg = np.average(fpses)
+            infer_total_time = time.time() - infer_start_time
+            acc1_avg = np.mean(infer_accs1)
+            acc5_avg = np.mean(infer_accs5)
+            _logger.info('Total inference run time: {:.2f} s'.format(
+                infer_total_time))
+
+            return outputs, acc1_avg, acc5_avg, fps_avg, latency_avg
+
+    def test_graph_transformation(self):
+        if not fluid.core.is_compiled_with_mkldnn():
+            return
+
+        infer_model_path = test_case_args.infer_model
+        assert infer_model_path, 'The model path cannot be empty. Please, use the --infer_model option.'
+        data_path = test_case_args.infer_data
+        assert data_path, 'The dataset path cannot be empty. Please, use the --infer_data option.'
+        batch_size = test_case_args.batch_size
+        batch_num = test_case_args.batch_num
+        skip_batch_num = test_case_args.skip_batch_num
+        with_accuracy_layer = test_case_args.with_accuracy_layer
+
+        _logger.info('Inference model: {0}'.format(infer_model_path))
+        _logger.info('Dataset: {0}'.format(data_path))
+        _logger.info('Batch size: {0}'.format(batch_size))
+        _logger.info('Batch number: {0}'.format(batch_num))
+
+        _logger.info('--- Inference prediction start ---')
+        val_reader = paddle.batch(
+            self._reader_creator(data_path), batch_size=batch_size)
+        fp32_output, fp32_acc1, fp32_acc5, fp32_fps, fp32_lat = self._predict(
+            val_reader, infer_model_path, with_accuracy_layer, batch_size,
+            batch_num, skip_batch_num)
+        _logger.info(
+            'Inference: avg top1 accuracy: {0:.4f}, avg top5 accuracy: {1:.4f}'.
+            format(fp32_acc1, fp32_acc5))
+        _logger.info('Inference: avg fps: {0:.2f}, avg latency: {1:.4f} ms'.
+                     format(fp32_fps, fp32_lat))
+
+
+if __name__ == '__main__':
+    global test_case_args
+    test_case_args, remaining_args = parse_args()
+    unittest.main(argv=remaining_args)
--- a/demo/models/__init__.py
+++ b/demo/models/__init__.py
 from __future__ import absolute_import
 from .mobilenet import MobileNet
 from .resnet import ResNet34, ResNet50
-from .resnet_vd import ResNet50_vd
-from .mobilenet_v2 import MobileNetV2
+from .resnet_vd import ResNet50_vd, ResNet101_vd
+from .mobilenet_v2 import MobileNetV2_x0_25, MobileNetV2
 from .pvanet import PVANet
+from .slimfacenet import SlimFaceNet_A_x0_60, SlimFaceNet_B_x0_75, SlimFaceNet_C_x0_75
+from .mobilenet_v3 import *
 __all__ = [
    "model_list", "MobileNet", "ResNet34", "ResNet50", "MobileNetV2", "PVANet",
-    "ResNet50_vd"
+    "ResNet50_vd", "ResNet101_vd", "MobileNetV2_x0_25"
 ]
 model_list = [
-    'MobileNet', 'ResNet34', 'ResNet50', 'MobileNetV2', 'PVANet', 'ResNet50_vd'
+    'MobileNet', 'ResNet34', 'ResNet50', 'MobileNetV2', 'PVANet',
+    'ResNet50_vd', "ResNet101_vd", "MobileNetV2_x0_25"
 ]
+
+__all__ += mobilenet_v3.__all__
+model_list += mobilenet_v3.__all__
--- a/demo/models/mobilenet_v3.py
+++ b/demo/models/mobilenet_v3.py
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+import math
+
+__all__ = [
+    'MobileNetV3', 'MobileNetV3_small_x0_25', 'MobileNetV3_small_x0_5',
+    'MobileNetV3_small_x0_75', 'MobileNetV3_small_x1_0',
+    'MobileNetV3_small_x1_25', 'MobileNetV3_large_x0_25',
+    'MobileNetV3_large_x0_5', 'MobileNetV3_large_x0_75',
+    'MobileNetV3_large_x1_0', 'MobileNetV3_large_x1_25',
+    'MobileNetV3_large_x2_0'
+]
+
+
+class MobileNetV3():
+    def __init__(self, scale=1.0, model_name='small'):
+        self.scale = scale
+        self.inplanes = 16
+        if model_name == "large":
+            self.cfg = [
+                # k, exp, c,  se,     nl,  s,
+                [3, 16, 16, False, 'relu', 1],
+                [3, 64, 24, False, 'relu', 2],
+                [3, 72, 24, False, 'relu', 1],
+                [5, 72, 40, True, 'relu', 2],
+                [5, 120, 40, True, 'relu', 1],
+                [5, 120, 40, True, 'relu', 1],
+                [3, 240, 80, False, 'hard_swish', 2],
+                [3, 200, 80, False, 'hard_swish', 1],
+                [3, 184, 80, False, 'hard_swish', 1],
+                [3, 184, 80, False, 'hard_swish', 1],
+                [3, 480, 112, True, 'hard_swish', 1],
+                [3, 672, 112, True, 'hard_swish', 1],
+                [5, 672, 160, True, 'hard_swish', 2],
+                [5, 960, 160, True, 'hard_swish', 1],
+                [5, 960, 160, True, 'hard_swish', 1],
+            ]
+            self.cls_ch_squeeze = 960
+            self.cls_ch_expand = 1280
+        elif model_name == "small":
+            self.cfg = [
+                # k, exp, c,  se,     nl,  s,
+                [3, 16, 16, True, 'relu', 2],
+                [3, 72, 24, False, 'relu', 2],
+                [3, 88, 24, False, 'relu', 1],
+                [5, 96, 40, True, 'hard_swish', 2],
+                [5, 240, 40, True, 'hard_swish', 1],
+                [5, 240, 40, True, 'hard_swish', 1],
+                [5, 120, 48, True, 'hard_swish', 1],
+                [5, 144, 48, True, 'hard_swish', 1],
+                [5, 288, 96, True, 'hard_swish', 2],
+                [5, 576, 96, True, 'hard_swish', 1],
+                [5, 576, 96, True, 'hard_swish', 1],
+            ]
+            self.cls_ch_squeeze = 576
+            self.cls_ch_expand = 1280
+        else:
+            raise NotImplementedError
+
+    def net(self, input, class_dim=1000):
+        scale = self.scale
+        inplanes = self.inplanes
+        cfg = self.cfg
+        cls_ch_squeeze = self.cls_ch_squeeze
+        cls_ch_expand = self.cls_ch_expand
+
+        #conv1
+        conv = self.conv_bn_layer(
+            input,
+            filter_size=3,
+            #num_filters=int(scale*inplanes),
+            num_filters=inplanes if scale <= 1.0 else int(inplanes * scale),
+            stride=2,
+            padding=1,
+            num_groups=1,
+            if_act=True,
+            act='hard_swish',
+            name='conv1')
+        print(conv.shape)
+        i = 0
+        for layer_cfg in cfg:
+            conv = self.residual_unit(
+                input=conv,
+                num_in_filter=inplanes,
+                num_mid_filter=int(scale * layer_cfg[1]),
+                num_out_filter=int(scale * layer_cfg[2]),
+                act=layer_cfg[4],
+                stride=layer_cfg[5],
+                filter_size=layer_cfg[0],
+                use_se=layer_cfg[3],
+                name='conv' + str(i + 2))
+
+            inplanes = int(scale * layer_cfg[2])
+            i += 1
+
+        conv = self.conv_bn_layer(
+            input=conv,
+            filter_size=1,
+            num_filters=int(scale * cls_ch_squeeze),
+            stride=1,
+            padding=0,
+            num_groups=1,
+            if_act=True,
+            act='hard_swish',
+            name='conv_last')
+        conv = fluid.layers.pool2d(
+            input=conv, pool_type='avg', global_pooling=True, use_cudnn=False)
+        conv = fluid.layers.conv2d(
+            input=conv,
+            num_filters=cls_ch_expand,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            param_attr=ParamAttr(name='last_1x1_conv_weights'),
+            bias_attr=False)
+        #conv = fluid.layers.hard_swish(conv)
+        conv = self.hard_swish(conv)
+        out = fluid.layers.fc(input=conv,
+                              size=class_dim,
+                              act='softmax',
+                              param_attr=ParamAttr(name='fc_weights'),
+                              bias_attr=ParamAttr(name='fc_offset'))
+        return out
+
+    def conv_bn_layer(self,
+                      input,
+                      filter_size,
+                      num_filters,
+                      stride,
+                      padding,
+                      num_groups=1,
+                      if_act=True,
+                      act=None,
+                      name=None,
+                      use_cudnn=True):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=num_groups,
+            act=None,
+            use_cudnn=use_cudnn,
+            param_attr=ParamAttr(name=name + '_weights'),
+            bias_attr=False)
+        bn_name = name + '_bn'
+        bn = fluid.layers.batch_norm(
+            input=conv,
+            param_attr=ParamAttr(
+                name=bn_name + "_scale",
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=0.0)),
+            bias_attr=ParamAttr(
+                name=bn_name + "_offset",
+                regularizer=fluid.regularizer.L2DecayRegularizer(
+                    regularization_coeff=0.0)),
+            moving_mean_name=bn_name + '_mean',
+            moving_variance_name=bn_name + '_variance')
+        if if_act:
+            if act == 'relu':
+                bn = fluid.layers.relu(bn)
+            elif act == 'hard_swish':
+                #bn = fluid.layers.hard_swish(bn)
+                bn = self.hard_swish(bn)
+        return bn
+
+    def hard_swish(self, x):
+        return x * fluid.layers.relu6(x + 3) / 6.
+
+    def se_block(self, input, num_out_filter, ratio=4, name=None):
+        num_mid_filter = int(num_out_filter // ratio)
+        pool = fluid.layers.pool2d(
+            input=input, pool_type='avg', global_pooling=True, use_cudnn=False)
+        conv1 = fluid.layers.conv2d(
+            input=pool,
+            filter_size=1,
+            num_filters=num_mid_filter,
+            act='relu',
+            param_attr=ParamAttr(name=name + '_1_weights'),
+            bias_attr=ParamAttr(name=name + '_1_offset'))
+        conv2 = fluid.layers.conv2d(
+            input=conv1,
+            filter_size=1,
+            num_filters=num_out_filter,
+            act='hard_sigmoid',
+            param_attr=ParamAttr(name=name + '_2_weights'),
+            bias_attr=ParamAttr(name=name + '_2_offset'))
+
+        scale = fluid.layers.elementwise_mul(x=input, y=conv2, axis=0)
+        return scale
+
+    def residual_unit(self,
+                      input,
+                      num_in_filter,
+                      num_mid_filter,
+                      num_out_filter,
+                      stride,
+                      filter_size,
+                      act=None,
+                      use_se=False,
+                      name=None):
+
+        input_data = input
+        conv0 = self.conv_bn_layer(
+            input=input,
+            filter_size=1,
+            num_filters=num_mid_filter,
+            stride=1,
+            padding=0,
+            if_act=True,
+            act=act,
+            name=name + '_expand')
+
+        conv1 = self.conv_bn_layer(
+            input=conv0,
+            filter_size=filter_size,
+            num_filters=num_mid_filter,
+            stride=stride,
+            padding=int((filter_size - 1) // 2),
+            if_act=True,
+            act=act,
+            num_groups=num_mid_filter,
+            use_cudnn=False,
+            name=name + '_depthwise')
+
+        if use_se:
+            with fluid.name_scope('se_block_skip'):
+                conv1 = self.se_block(
+                    input=conv1,
+                    num_out_filter=num_mid_filter,
+                    name=name + '_se')
+
+        conv2 = self.conv_bn_layer(
+            input=conv1,
+            filter_size=1,
+            num_filters=num_out_filter,
+            stride=1,
+            padding=0,
+            if_act=False,
+            name=name + '_linear')
+        if num_in_filter != num_out_filter or stride != 1:
+            return conv2
+        else:
+            return fluid.layers.elementwise_add(
+                x=input_data, y=conv2, act=None)
+
+
+def MobileNetV3_small_x0_25():
+    model = MobileNetV3(model_name='small', scale=0.25)
+    return model
+
+
+def MobileNetV3_small_x0_5():
+    model = MobileNetV3(model_name='small', scale=0.5)
+    return model
+
+
+def MobileNetV3_small_x0_75():
+    model = MobileNetV3(model_name='small', scale=0.75)
+    return model
+
+
+def MobileNetV3_small_x1_0():
+    model = MobileNetV3(model_name='small', scale=1.0)
+    return model
+
+
+def MobileNetV3_small_x1_25():
+    model = MobileNetV3(model_name='small', scale=1.25)
+    return model
+
+
+def MobileNetV3_large_x0_25():
+    model = MobileNetV3(model_name='large', scale=0.25)
+    return model
+
+
+def MobileNetV3_large_x0_5():
+    model = MobileNetV3(model_name='large', scale=0.5)
+    return model
+
+
+def MobileNetV3_large_x0_75():
+    model = MobileNetV3(model_name='large', scale=0.75)
+    return model
+
+
+def MobileNetV3_large_x1_0():
+    model = MobileNetV3(model_name='large', scale=1.0)
+    return model
+
+
+def MobileNetV3_large_x1_25():
+    model = MobileNetV3(model_name='large', scale=1.25)
+    return model
+
+
+def MobileNetV3_large_x2_0():
+    model = MobileNetV3(model_name='large', scale=2.0)
+    return model
--- a/demo/models/slimfacenet.py
+++ b/demo/models/slimfacenet.py
+# ================================================================
+#   Copyright (c) 2020  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import datetime
+import numpy as np
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+
+
+class SlimFaceNet():
+    def __init__(self, class_dim, scale=0.6, arch=None):
+
+        assert arch is not None
+        self.arch = arch
+        self.class_dim = class_dim
+        kernels = [3]
+        expansions = [2, 4, 6]
+        SE = [0, 1]
+        self.table = []
+        for k in kernels:
+            for e in expansions:
+                for se in SE:
+                    self.table.append((k, e, se))
+
+        if scale == 1.0:
+            # 100% - channel
+            self.Slimfacenet_bottleneck_setting = [
+                # t, c , n ,s
+                [2, 64, 5, 2],
+                [4, 128, 1, 2],
+                [2, 128, 6, 1],
+                [4, 128, 1, 2],
+                [2, 128, 2, 1]
+            ]
+        elif scale == 0.9:
+            # 90% - channel
+            self.Slimfacenet_bottleneck_setting = [
+                # t, c , n ,s
+                [2, 56, 5, 2],
+                [4, 116, 1, 2],
+                [2, 116, 6, 1],
+                [4, 116, 1, 2],
+                [2, 116, 2, 1]
+            ]
+        elif scale == 0.75:
+            # 75% - channel
+            self.Slimfacenet_bottleneck_setting = [
+                # t, c , n ,s
+                [2, 48, 5, 2],
+                [4, 96, 1, 2],
+                [2, 96, 6, 1],
+                [4, 96, 1, 2],
+                [2, 96, 2, 1]
+            ]
+        elif scale == 0.6:
+            # 60% - channel
+            self.Slimfacenet_bottleneck_setting = [
+                # t, c , n ,s
+                [2, 40, 5, 2],
+                [4, 76, 1, 2],
+                [2, 76, 6, 1],
+                [4, 76, 1, 2],
+                [2, 76, 2, 1]
+            ]
+        else:
+            print('WRONG scale')
+            exit()
+        self.extract_feature = True
+
+    def set_extract_feature_flag(self, flag):
+        self.extract_feature = flag
+
+    def net(self, input, label=None):
+        x = self.conv_bn_layer(
+            input,
+            filter_size=3,
+            num_filters=64,
+            stride=2,
+            padding=1,
+            num_groups=1,
+            if_act=True,
+            name='conv3x3')
+        x = self.conv_bn_layer(
+            x,
+            filter_size=3,
+            num_filters=64,
+            stride=1,
+            padding=1,
+            num_groups=64,
+            if_act=True,
+            name='dw_conv3x3')
+
+        in_c = 64
+        cnt = 0
+        for _exp, out_c, times, _stride in self.Slimfacenet_bottleneck_setting:
+            for i in range(times):
+                stride = _stride if i == 0 else 1
+                filter_size, exp, se = self.table[self.arch[cnt]]
+                se = False if se == 0 else True
+                x = self.residual_unit(
+                    x,
+                    num_in_filter=in_c,
+                    num_out_filter=out_c,
+                    stride=stride,
+                    filter_size=filter_size,
+                    expansion_factor=exp,
+                    use_se=se,
+                    name='residual_unit' + str(cnt + 1))
+                cnt += 1
+                in_c = out_c
+
+        out_c = 512
+        x = self.conv_bn_layer(
+            x,
+            filter_size=1,
+            num_filters=out_c,
+            stride=1,
+            padding=0,
+            num_groups=1,
+            if_act=True,
+            name='conv1x1')
+        x = self.conv_bn_layer(
+            x,
+            filter_size=(7, 6),
+            num_filters=out_c,
+            stride=1,
+            padding=0,
+            num_groups=out_c,
+            if_act=False,
+            name='global_dw_conv7x7')
+        x = fluid.layers.conv2d(
+            x,
+            num_filters=128,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            groups=1,
+            act=None,
+            use_cudnn=True,
+            param_attr=ParamAttr(
+                name='linear_conv1x1_weights',
+                initializer=MSRA(),
+                regularizer=fluid.regularizer.L2Decay(4e-4)),
+            bias_attr=False)
+        bn_name = 'linear_conv1x1_bn'
+        x = fluid.layers.batch_norm(
+            x,
+            param_attr=ParamAttr(name=bn_name + "_scale"),
+            bias_attr=ParamAttr(name=bn_name + "_offset"),
+            moving_mean_name=bn_name + '_mean',
+            moving_variance_name=bn_name + '_variance')
+
+        x = fluid.layers.reshape(x, shape=[x.shape[0], x.shape[1]])
+
+        if self.extract_feature:
+            return x
+
+        out = self.arc_margin_product(
+            x, label, self.class_dim, s=32.0, m=0.50, mode=2)
+        softmax = fluid.layers.softmax(input=out)
+        cost = fluid.layers.cross_entropy(input=softmax, label=label)
+        loss = fluid.layers.mean(x=cost)
+        acc = fluid.layers.accuracy(input=out, label=label, k=1)
+        return loss, acc
+
+    def residual_unit(self,
+                      input,
+                      num_in_filter,
+                      num_out_filter,
+                      stride,
+                      filter_size,
+                      expansion_factor,
+                      use_se=False,
+                      name=None):
+
+        num_expfilter = int(round(num_in_filter * expansion_factor))
+        input_data = input
+
+        expand_conv = self.conv_bn_layer(
+            input=input,
+            filter_size=1,
+            num_filters=num_expfilter,
+            stride=1,
+            padding=0,
+            if_act=True,
+            name=name + '_expand')
+
+        depthwise_conv = self.conv_bn_layer(
+            input=expand_conv,
+            filter_size=filter_size,
+            num_filters=num_expfilter,
+            stride=stride,
+            padding=int((filter_size - 1) // 2),
+            if_act=True,
+            num_groups=num_expfilter,
+            use_cudnn=True,
+            name=name + '_depthwise')
+
+        if use_se:
+            depthwise_conv = self.se_block(
+                input=depthwise_conv,
+                num_out_filter=num_expfilter,
+                name=name + '_se')
+
+        linear_conv = self.conv_bn_layer(
+            input=depthwise_conv,
+            filter_size=1,
+            num_filters=num_out_filter,
+            stride=1,
+            padding=0,
+            if_act=False,
+            name=name + '_linear')
+        if num_in_filter != num_out_filter or stride != 1:
+            return linear_conv
+        else:
+            return fluid.layers.elementwise_add(
+                x=input_data, y=linear_conv, act=None)
+
+    def se_block(self, input, num_out_filter, ratio=4, name=None):
+        num_mid_filter = int(num_out_filter // ratio)
+        pool = fluid.layers.pool2d(
+            input=input, pool_type='avg', global_pooling=True, use_cudnn=False)
+        conv1 = fluid.layers.conv2d(
+            input=pool,
+            filter_size=1,
+            num_filters=num_mid_filter,
+            act=None,
+            param_attr=ParamAttr(name=name + '_1_weights'),
+            bias_attr=ParamAttr(name=name + '_1_offset'))
+        conv1 = fluid.layers.prelu(
+            conv1,
+            mode='channel',
+            param_attr=ParamAttr(
+                name=name + '_prelu',
+                regularizer=fluid.regularizer.L2Decay(0.0)))
+        conv2 = fluid.layers.conv2d(
+            input=conv1,
+            filter_size=1,
+            num_filters=num_out_filter,
+            act='hard_sigmoid',
+            param_attr=ParamAttr(name=name + '_2_weights'),
+            bias_attr=ParamAttr(name=name + '_2_offset'))
+        scale = fluid.layers.elementwise_mul(x=input, y=conv2, axis=0)
+        return scale
+
+    def conv_bn_layer(self,
+                      input,
+                      filter_size,
+                      num_filters,
+                      stride,
+                      padding,
+                      num_groups=1,
+                      if_act=True,
+                      name=None,
+                      use_cudnn=True):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=num_groups,
+            act=None,
+            use_cudnn=use_cudnn,
+            param_attr=ParamAttr(
+                name=name + '_weights', initializer=MSRA()),
+            bias_attr=False)
+        bn_name = name + '_bn'
+        bn = fluid.layers.batch_norm(
+            input=conv,
+            param_attr=ParamAttr(name=bn_name + "_scale"),
+            bias_attr=ParamAttr(name=bn_name + "_offset"),
+            moving_mean_name=bn_name + '_mean',
+            moving_variance_name=bn_name + '_variance')
+        if if_act:
+            return fluid.layers.prelu(
+                bn,
+                mode='channel',
+                param_attr=ParamAttr(
+                    name=name + '_prelu',
+                    regularizer=fluid.regularizer.L2Decay(0.0)))
+        else:
+            return bn
+
+    def arc_margin_product(self, input, label, out_dim, s=32.0, m=0.50,
+                           mode=2):
+        input_norm = fluid.layers.sqrt(
+            fluid.layers.reduce_sum(
+                fluid.layers.square(input), dim=1))
+        input = fluid.layers.elementwise_div(input, input_norm, axis=0)
+
+        weight = fluid.layers.create_parameter(
+            shape=[out_dim, input.shape[1]],
+            dtype='float32',
+            name='weight_norm',
+            attr=fluid.param_attr.ParamAttr(
+                initializer=fluid.initializer.Xavier(),
+                regularizer=fluid.regularizer.L2Decay(4e-4)))
+
+        weight_norm = fluid.layers.sqrt(
+            fluid.layers.reduce_sum(
+                fluid.layers.square(weight), dim=1))
+        weight = fluid.layers.elementwise_div(weight, weight_norm, axis=0)
+        weight = fluid.layers.transpose(weight, perm=[1, 0])
+        cosine = fluid.layers.mul(input, weight)
+        sine = fluid.layers.sqrt(1.0 - fluid.layers.square(cosine))
+
+        cos_m = math.cos(m)
+        sin_m = math.sin(m)
+        phi = cosine * cos_m - sine * sin_m
+
+        th = math.cos(math.pi - m)
+        mm = math.sin(math.pi - m) * m
+
+        if mode == 1:
+            phi = self.paddle_where_more_than(cosine, 0, phi, cosine)
+        elif mode == 2:
+            phi = self.paddle_where_more_than(cosine, th, phi, cosine - mm)
+        else:
+            pass
+
+        one_hot = fluid.one_hot(input=label, depth=out_dim)
+        output = fluid.layers.elementwise_mul(
+            one_hot, phi) + fluid.layers.elementwise_mul(
+                (1.0 - one_hot), cosine)
+        output = output * s
+        return output
+
+    def paddle_where_more_than(self, target, limit, x, y):
+        mask = fluid.layers.cast(x=(target > limit), dtype='float32')
+        output = fluid.layers.elementwise_mul(
+            mask, x) + fluid.layers.elementwise_mul((1.0 - mask), y)
+        return output
+
+
+def SlimFaceNet_A_x0_60(class_dim=None, scale=0.6, arch=None):
+    scale = 0.6
+    arch = [0, 1, 5, 1, 0, 2, 1, 2, 0, 1, 2, 1, 1, 0, 1]
+    return SlimFaceNet(class_dim=class_dim, scale=scale, arch=arch)
+
+
+def SlimFaceNet_B_x0_75(class_dim=None, scale=0.6, arch=None):
+    scale = 0.75
+    arch = [1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 3, 2, 2, 3]
+    return SlimFaceNet(class_dim=class_dim, scale=scale, arch=arch)
+
+
+def SlimFaceNet_C_x0_75(class_dim=None, scale=0.6, arch=None):
+    scale = 0.75
+    arch = [1, 1, 2, 1, 0, 2, 1, 0, 1, 0, 1, 1, 2, 2, 3]
+    return SlimFaceNet(class_dim=class_dim, scale=scale, arch=arch)
+
+
+if __name__ == "__main__":
+    x = fluid.data(name='x', shape=[-1, 3, 112, 112], dtype='float32')
+    print(x.shape)
+    model = SlimFaceNet(10000, [1, 3, 3, 1, 1, 0, 0, 1, 0, 1, 1, 0, 5, 5, 3])
+    y = model.net(x)
--- a/demo/nas/README.md
+++ b/demo/nas/README.md
-# 网络结构搜索示例
+# SANAS网络结构搜索示例

-本示例介绍如何使用网络结构搜索接口，搜索到一个更小或者精度更高的模型，该文档仅介绍paddleslim中SANAS的使用及如何利用SANAS得到模型结构，完整示例代码请参考sa_nas_mobilenetv2.py或者block_sa_nas_mobilenetv2.py。
+本示例介绍如何使用网络结构搜索接口，搜索到一个更小或者精度更高的模型，该示例介绍paddleslim中SANAS的使用及如何利用SANAS得到模型结构，完整示例代码请参考sa_nas_mobilenetv2.py或者block_sa_nas_mobilenetv2.py。

 ## 数据准备
 本示例默认使用cifar10数据，cifar10数据会根据调用的paddle接口自动下载，无需额外准备。
@@ -8,7 +8,7 @@
 ## 接口介绍
 请参考<a href='../../docs/zh_cn/api_cn/nas_api.rst'>神经网络搜索API文档</a>。

-本示例为在MobileNetV2的搜索空间上搜索FLOPs更小的模型。
+本示例为利用SANAS在MobileNetV2的搜索空间上搜索FLOPs更小的模型。
 ## 1 搜索空间配置
 默认搜索空间为`MobileNetV2`，详细的搜索空间配置请参考<a href='../../docs/zh_cn/api_cn/search_space.md'>搜索空间配置文档</a>。

@@ -24,3 +24,29 @@ CUDA_VISIBLE_DEVICES=0 python sa_nas_mobilenetv2.py
 ```shell
 CUDA_VISIBLE_DEVICES=0 python block_sa_nas_mobilenetv2.py
 ```
+
+# RLNAS网络结构搜索示例
+
+本示例介绍如何使用RLNAS接口进行网络结构搜索，该示例介绍paddleslim中RLNAS的使用，完整示例代码请参考rl_nas_mobilenetv2.py或者parl_nas_mobilenetv2.py。
+
+## 数据准备
+本示例默认使用cifar10数据，cifar10数据会根据调用的paddle接口自动下载，无需额外准备。
+
+## 接口介绍
+请参考<a href='../../docs/zh_cn/api_cn/nas_api.rst'>神经网络搜索API文档</a>。
+
+示例为利用SANAS在MobileNetV2的搜索空间上搜索精度更高的模型。
+## 1 搜索空间配置
+默认搜索空间为`MobileNetV2`，详细的搜索空间配置请参考<a href='../../docs/zh_cn/api_cn/search_space.md'>搜索空间配置文档</a>。
+
+## 2 启动训练
+
+### 2.1 启动基于MobileNetV2初始模型结构构造搜索空间，强化学习算法为lstm的搜索实验
+```shell
+CUDA_VISIBLE_DEVICES=0 python rl_nas_mobilenetv2.py
+```
+
+### 2.2 启动基于MobileNetV2初始模型结构构造搜索空间，强化学习算法为ddpg的搜索实验
+```shell
+CUDA_VISIBLE_DEVICES=0 python parl_nas_mobilenetv2.py
+```
--- a/demo/nas/block_sa_nas_mobilenetv2.py
+++ b/demo/nas/block_sa_nas_mobilenetv2.py
@@ -137,22 +137,22 @@ def search_mobilenetv2_block(config, args, image_size):
        exe.run(startup_program)

        if args.data == 'cifar10':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                paddle.reader.shuffle(
                    paddle.dataset.cifar.train10(cycle=False), buf_size=1024),
                batch_size=args.batch_size,
                drop_last=True)

-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                paddle.dataset.cifar.test10(cycle=False),
                batch_size=args.batch_size,
                drop_last=False)
        elif args.data == 'imagenet':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                imagenet_reader.train(),
                batch_size=args.batch_size,
                drop_last=True)
-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                imagenet_reader.val(),
                batch_size=args.batch_size,
                drop_last=False)

--- a/demo/nas/image_classification_nas_quick_start.ipynb
+++ b/demo/nas/image_classification_nas_quick_start.ipynb
@@ -114,9 +114,9 @@
    "    if current_flops > 321208544:\n",
    "        continue\n",
    "    \n",
-    "    train_reader = paddle.batch(paddle.reader.shuffle(paddle.dataset.cifar.train10(cycle=False),                          buf_size=1024),batch_size=256)\n",
+    "    train_reader = paddle.fluid.io.batch(paddle.reader.shuffle(paddle.dataset.cifar.train10(cycle=False),                          buf_size=1024),batch_size=256)\n",
    "    train_feeder = fluid.DataFeeder(inputs, fluid.CPUPlace())\n",
-    "    test_reader = paddle.batch(paddle.dataset.cifar.test10(cycle=False),\n",
+    "    test_reader = paddle.fluid.io.batch(paddle.dataset.cifar.test10(cycle=False),\n",
    "               batch_size=256)\n",
    "    test_feeder = fluid.DataFeeder(inputs, fluid.CPUPlace())\n",
    "\n",
@@ -160,4 +160,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 2
-}
\ No newline at end of file
+}
--- a/demo/nas/parl_nas_mobilenetv2.py
+++ b/demo/nas/parl_nas_mobilenetv2.py
+import sys
+sys.path.append('..')
+import numpy as np
+import argparse
+import ast
+import time
+import argparse
+import ast
+import logging
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddleslim.nas import RLNAS
+from paddleslim.common import get_logger
+from optimizer import create_optimizer
+import imagenet_reader
+
+_logger = get_logger(__name__, level=logging.INFO)
+
+
+def create_data_loader(image_shape):
+    data_shape = [None] + image_shape
+    data = fluid.data(name='data', shape=data_shape, dtype='float32')
+    label = fluid.data(name='label', shape=[None, 1], dtype='int64')
+    data_loader = fluid.io.DataLoader.from_generator(
+        feed_list=[data, label],
+        capacity=1024,
+        use_double_buffer=True,
+        iterable=True)
+    return data_loader, data, label
+
+
+def build_program(main_program,
+                  startup_program,
+                  image_shape,
+                  archs,
+                  args,
+                  is_test=False):
+    with fluid.program_guard(main_program, startup_program):
+        with fluid.unique_name.guard():
+            data_loader, data, label = create_data_loader(image_shape)
+            output = archs(data)
+            output = fluid.layers.fc(input=output, size=args.class_dim)
+
+            softmax_out = fluid.layers.softmax(input=output, use_cudnn=False)
+            cost = fluid.layers.cross_entropy(input=softmax_out, label=label)
+            avg_cost = fluid.layers.mean(cost)
+            acc_top1 = fluid.layers.accuracy(
+                input=softmax_out, label=label, k=1)
+            acc_top5 = fluid.layers.accuracy(
+                input=softmax_out, label=label, k=5)
+
+            if is_test == False:
+                optimizer = create_optimizer(args)
+                optimizer.minimize(avg_cost)
+    return data_loader, avg_cost, acc_top1, acc_top5
+
+
+def search_mobilenetv2(config, args, image_size, is_server=True):
+    if is_server:
+        ### start a server and a client
+        rl_nas = RLNAS(
+            key='ddpg',
+            configs=config,
+            is_sync=False,
+            obs_dim=26,  ### step + length_of_token
+            server_addr=(args.server_address, args.port))
+    else:
+        ### start a client
+        rl_nas = RLNAS(
+            key='ddpg',
+            configs=config,
+            is_sync=False,
+            obs_dim=26,
+            server_addr=(args.server_address, args.port),
+            is_server=False)
+
+    image_shape = [3, image_size, image_size]
+    for step in range(args.search_steps):
+        if step == 0:
+            action_prev = [1. for _ in rl_nas.range_tables]
+        else:
+            action_prev = rl_nas.tokens[0]
+        obs = [step]
+        obs.extend(action_prev)
+        archs = rl_nas.next_archs(obs=obs)[0][0]
+
+        train_program = fluid.Program()
+        test_program = fluid.Program()
+        startup_program = fluid.Program()
+        train_loader, avg_cost, acc_top1, acc_top5 = build_program(
+            train_program, startup_program, image_shape, archs, args)
+
+        test_loader, test_avg_cost, test_acc_top1, test_acc_top5 = build_program(
+            test_program,
+            startup_program,
+            image_shape,
+            archs,
+            args,
+            is_test=True)
+        test_program = test_program.clone(for_test=True)
+
+        place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+        exe = fluid.Executor(place)
+        exe.run(startup_program)
+
+        if args.data == 'cifar10':
+            train_reader = paddle.fluid.io.batch(
+                paddle.reader.shuffle(
+                    paddle.dataset.cifar.train10(cycle=False), buf_size=1024),
+                batch_size=args.batch_size,
+                drop_last=True)
+
+            test_reader = paddle.fluid.io.batch(
+                paddle.dataset.cifar.test10(cycle=False),
+                batch_size=args.batch_size,
+                drop_last=False)
+        elif args.data == 'imagenet':
+            train_reader = paddle.fluid.io.batch(
+                imagenet_reader.train(),
+                batch_size=args.batch_size,
+                drop_last=True)
+            test_reader = paddle.fluid.io.batch(
+                imagenet_reader.val(),
+                batch_size=args.batch_size,
+                drop_last=False)
+
+        train_loader.set_sample_list_generator(
+            train_reader,
+            places=fluid.cuda_places() if args.use_gpu else fluid.cpu_places())
+        test_loader.set_sample_list_generator(test_reader, places=place)
+
+        build_strategy = fluid.BuildStrategy()
+        train_compiled_program = fluid.CompiledProgram(
+            train_program).with_data_parallel(
+                loss_name=avg_cost.name, build_strategy=build_strategy)
+        for epoch_id in range(args.retain_epoch):
+            for batch_id, data in enumerate(train_loader()):
+                fetches = [avg_cost.name]
+                s_time = time.time()
+                outs = exe.run(train_compiled_program,
+                               feed=data,
+                               fetch_list=fetches)[0]
+                batch_time = time.time() - s_time
+                if batch_id % 10 == 0:
+                    _logger.info(
+                        'TRAIN: steps: {}, epoch: {}, batch: {}, cost: {}, batch_time: {}ms'.
+                        format(step, epoch_id, batch_id, outs[0], batch_time))
+
+        reward = []
+        for batch_id, data in enumerate(test_loader()):
+            test_fetches = [
+                test_avg_cost.name, test_acc_top1.name, test_acc_top5.name
+            ]
+            batch_reward = exe.run(test_program,
+                                   feed=data,
+                                   fetch_list=test_fetches)
+            reward_avg = np.mean(np.array(batch_reward), axis=1)
+            reward.append(reward_avg)
+
+            _logger.info(
+                'TEST: step: {}, batch: {}, avg_cost: {}, acc_top1: {}, acc_top5: {}'.
+                format(step, batch_id, batch_reward[0], batch_reward[1],
+                       batch_reward[2]))
+
+        finally_reward = np.mean(np.array(reward), axis=0)
+        _logger.info(
+            'FINAL TEST: avg_cost: {}, acc_top1: {}, acc_top5: {}'.format(
+                finally_reward[0], finally_reward[1], finally_reward[2]))
+
+        obs = np.expand_dims(obs, axis=0).astype('float32')
+        actions = rl_nas.tokens
+        obs_next = [step + 1]
+        obs_next.extend(actions[0])
+        obs_next = np.expand_dims(obs_next, axis=0).astype('float32')
+
+        if step == args.search_steps - 1:
+            terminal = np.expand_dims([True], axis=0).astype(np.bool)
+        else:
+            terminal = np.expand_dims([False], axis=0).astype(np.bool)
+        rl_nas.reward(
+            np.expand_dims(
+                np.float32(finally_reward[1]), axis=0),
+            obs=obs,
+            actions=actions.astype('float32'),
+            obs_next=obs_next,
+            terminal=terminal)
+
+        if step == 2:
+            sys.exit(0)
+
+
+if __name__ == '__main__':
+
+    parser = argparse.ArgumentParser(
+        description='RL NAS MobileNetV2 cifar10 argparase')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='Whether to use GPU in train/test model.')
+    parser.add_argument(
+        '--batch_size', type=int, default=256, help='batch size.')
+    parser.add_argument(
+        '--class_dim', type=int, default=10, help='classify number.')
+    parser.add_argument(
+        '--data',
+        type=str,
+        default='cifar10',
+        choices=['cifar10', 'imagenet'],
+        help='server address.')
+    parser.add_argument(
+        '--is_server',
+        type=ast.literal_eval,
+        default=True,
+        help='Whether to start a server.')
+    parser.add_argument(
+        '--search_steps',
+        type=int,
+        default=100,
+        help='controller server number.')
+    parser.add_argument(
+        '--server_address', type=str, default="", help='server ip.')
+    parser.add_argument('--port', type=int, default=8881, help='server port')
+    parser.add_argument(
+        '--retain_epoch', type=int, default=5, help='epoch for each token.')
+    parser.add_argument('--lr', type=float, default=0.1, help='learning rate.')
+    args = parser.parse_args()
+    print(args)
+
+    if args.data == 'cifar10':
+        image_size = 32
+        block_num = 3
+    elif args.data == 'imagenet':
+        image_size = 224
+        block_num = 6
+    else:
+        raise NotImplementedError(
+            'data must in [cifar10, imagenet], but received: {}'.format(
+                args.data))
+
+    config = [('MobileNetV2Space')]
+
+    search_mobilenetv2(config, args, image_size, is_server=args.is_server)
--- a/demo/nas/rl_nas_mobilenetv2.py
+++ b/demo/nas/rl_nas_mobilenetv2.py
@@ -65,8 +65,10 @@ def search_mobilenetv2(config, args, image_size, is_server=True):
            is_sync=False,
            server_addr=(args.server_address, args.port),
            controller_batch_size=1,
+            controller_decay_steps=1000,
+            controller_decay_rate=0.8,
            lstm_num_layers=1,
-            hidden_size=100,
+            hidden_size=10,
            temperature=1.0)
    else:
        ### start a client
@@ -78,6 +80,9 @@ def search_mobilenetv2(config, args, image_size, is_server=True):
            lstm_num_layers=1,
            hidden_size=10,
            temperature=1.0,
+            controller_batch_size=1,
+            controller_decay_steps=1000,
+            controller_decay_rate=0.8,
            is_server=False)

    image_shape = [3, image_size, image_size]
@@ -104,22 +109,22 @@ def search_mobilenetv2(config, args, image_size, is_server=True):
        exe.run(startup_program)

        if args.data == 'cifar10':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                paddle.reader.shuffle(
                    paddle.dataset.cifar.train10(cycle=False), buf_size=1024),
                batch_size=args.batch_size,
                drop_last=True)

-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                paddle.dataset.cifar.test10(cycle=False),
                batch_size=args.batch_size,
                drop_last=False)
        elif args.data == 'imagenet':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                imagenet_reader.train(),
                batch_size=args.batch_size,
                drop_last=True)
-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                imagenet_reader.val(),
                batch_size=args.batch_size,
                drop_last=False)
@@ -182,7 +187,7 @@ if __name__ == '__main__':
    parser.add_argument(
        '--batch_size', type=int, default=256, help='batch size.')
    parser.add_argument(
-        '--class_dim', type=int, default=1000, help='classify number.')
+        '--class_dim', type=int, default=10, help='classify number.')
    parser.add_argument(
        '--data',
        type=str,

--- a/demo/nas/sa_nas_mobilenetv2.py
+++ b/demo/nas/sa_nas_mobilenetv2.py
@@ -102,22 +102,22 @@ def search_mobilenetv2(config, args, image_size, is_server=True):
        exe.run(startup_program)

        if args.data == 'cifar10':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                paddle.reader.shuffle(
                    paddle.dataset.cifar.train10(cycle=False), buf_size=1024),
                batch_size=args.batch_size,
                drop_last=True)

-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                paddle.dataset.cifar.test10(cycle=False),
                batch_size=args.batch_size,
                drop_last=False)
        elif args.data == 'imagenet':
-            train_reader = paddle.batch(
+            train_reader = paddle.fluid.io.batch(
                imagenet_reader.train(),
                batch_size=args.batch_size,
                drop_last=True)
-            test_reader = paddle.batch(
+            test_reader = paddle.fluid.io.batch(
                imagenet_reader.val(),
                batch_size=args.batch_size,
                drop_last=False)
@@ -177,7 +177,7 @@ def test_search_result(tokens, image_size, args, config):

    image_shape = [3, image_size, image_size]

-    archs = sa_nas.tokens2arch(tokens)
+    archs = sa_nas.tokens2arch(tokens)[0]

    train_program = fluid.Program()
    test_program = fluid.Program()
@@ -197,22 +197,22 @@ def test_search_result(tokens, image_size, args, config):
    exe.run(startup_program)

    if args.data == 'cifar10':
-        train_reader = paddle.batch(
+        train_reader = paddle.fluid.io.batch(
            paddle.reader.shuffle(
                paddle.dataset.cifar.train10(cycle=False), buf_size=1024),
            batch_size=args.batch_size,
            drop_last=True)

-        test_reader = paddle.batch(
+        test_reader = paddle.fluid.io.batch(
            paddle.dataset.cifar.test10(cycle=False),
            batch_size=args.batch_size,
            drop_last=False)
    elif args.data == 'imagenet':
-        train_reader = paddle.batch(
+        train_reader = paddle.fluid.io.batch(
            imagenet_reader.train(),
            batch_size=args.batch_size,
            drop_last=True)
-        test_reader = paddle.batch(
+        test_reader = paddle.fluid.io.batch(
            imagenet_reader.val(), batch_size=args.batch_size, drop_last=False)

    train_loader.set_sample_list_generator(
@@ -271,7 +271,7 @@ if __name__ == '__main__':
    parser.add_argument(
        '--batch_size', type=int, default=256, help='batch size.')
    parser.add_argument(
-        '--class_dim', type=int, default=1000, help='classify number.')
+        '--class_dim', type=int, default=10, help='classify number.')
    parser.add_argument(
        '--data',
        type=str,

--- a/demo/nas/darts_nas.py
+++ b/demo/nas/darts_nas.py
--- a/demo/one_shot/train.py
+++ b/demo/one_shot/train.py
@@ -113,7 +113,7 @@ def test_mnist(model, tokens=None):
    acc_set = []
    avg_loss_set = []
    batch_size = 64
-    test_reader = paddle.batch(
+    test_reader = paddle.fluid.io.batch(
        paddle.dataset.mnist.test(), batch_size=batch_size, drop_last=True)
    for batch_id, data in enumerate(test_reader()):
        dy_x_data = np.array([x[0].reshape(1, 28, 28)
@@ -145,7 +145,7 @@ def train_mnist(args, model, tokens=None):
    adam = AdamOptimizer(
        learning_rate=0.001, parameter_list=model.parameters())

-    train_reader = paddle.batch(
+    train_reader = paddle.fluid.io.batch(
        paddle.dataset.mnist.train(), batch_size=BATCH_SIZE, drop_last=True)
    if args.use_data_parallel:
        train_reader = fluid.contrib.reader.distributed_batch_reader(

--- a/demo/pantheon/lexical_anlysis/README.md
+++ b/demo/pantheon/lexical_anlysis/README.md
+# Distillation example: Chinese lexical analysis
+We demonstrated how to use the Pantheon framework for online distillation of the Chinese lexical analysis model with sample dataset. The effect of large-scale online distillation is shown below:
+| model | Precision | Recall | F1-score|
+| ------ | ------ | ------ | ------ |
+| BiGRU | 89.2 | 89.4 | 89.3 |
+| BERT fine-tuned | 90.2 | 90.4 | 90.3 |
+| ERNIE fine-tuned | 91.7 | 91.7 | 91.7 |
+| DistillBiGRU | 90.20  | 90.52 | 90.36 |
+
+BiGRU is to train a BiGRU based LAC model from scratch; BERT fine-tuned is to fine-tune LAC task on BERT base model; ERNIE fine-tuned is to fine-tune LAC task on BERT base model; DistillBiGRU is trained through large-scale online distillation with ERNIE fine-tuned as teacher model.
+
+## Introduction
+
+Lexical Analysis of Chinese, or LAC for short, is a lexical analysis model that completes the tasks of Chinese word segmentation, part-of-speech tagging, and named entity recognition in a single model. We conduct an overall evaluation of word segmentation, part-of-speech tagging, and named entity recognition on a self-built dataset. We use the finetuned [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) model as the Teacher model and GRU as the Student model, which are needed by the Pantheon framework for online distillation.
+
+#### 1. Download the training data set
+
+Download the data set file, and after decompression, a `./data/` folder will be created.
+```bash
+python downloads.py dataset
+```
+
+#### 2. Download the Teacher model
+
+```bash
+# download ERNIE finetuned model
+python downloads.py finetuned
+python downloads.py conf
+```
+
+### 3. Distilling Student model
+```bash
+# start teacher service
+bash run_teacher.sh
+
+# start student service
+bash run_student.sh
+```
+
+> If you want to learn more about LAC, you can refer to this repo: https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis
\ No newline at end of file
--- a/demo/pantheon/lexical_anlysis/README_cn.md
+++ b/demo/pantheon/lexical_anlysis/README_cn.md
+# 蒸馏样例：中文词法分析
+我们在样例数据集上，对中文词法分析模型，演示了如何使用Pantheon框架进行在线蒸馏。大规模在线蒸馏的效果如下图所示：
+
+| 模型 | 精度 | 召回率 | F1值|
+| ------ | ------ | ------ | ------ |
+| BiGRU | 89.2 | 89.4 | 89.3 |
+| BERT fine-tuned | 90.2 | 90.4 | 90.3 |
+| ERNIE fine-tuned | 91.7 | 91.7 | 91.7 |
+| DistillBiGRU | 90.20  | 90.52 | 90.36 |
+
+BiGRU 是使用双向GRU网络从头训练LAC任务；BERT fine-tuned 是在BERT base模型上微调LAC任务；ERNIE fine-tuned 是在ERNIE base模型上微调LAC任务；DistillBiGRU 是使用ERNIE fine-tuned模型作为teacher模型，通过大规模蒸馏训练LAC任务。
+
+## 简介
+
+Lexical Analysis of Chinese，简称 LAC，是一个联合的词法分析模型，在单个模型中完成中文分词、词性标注、专名识别任务。我们在自建的数据集上对分词、词性标注、专名识别进行整体的评估效果。我们使用经过finetune的 [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) 模型作为Teacher模型，使用GRU作为Student模型，使用Pantheon框架进行在线蒸馏。
+
+#### 1. 下载训练数据集
+
+下载数据集文件，解压后会生成 `./data/` 文件夹
+```bash
+python downloads.py dataset
+```
+
+#### 2. 下载Teacher模型
+
+```bash
+# download ERNIE finetuned model
+python downloads.py finetuned
+python downloads.py conf
+```
+
+### 3. 蒸馏Student模型
+```bash
+# start teacher service
+bash run_teacher.sh
+
+# start student service
+bash run_student.sh
+```
+
+> 如果你想详细了解LAC的原理可以参照相关repo: https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis
--- a/demo/pantheon/lexical_anlysis/__init__.py
+++ b/demo/pantheon/lexical_anlysis/__init__.py
+from .teacher import Teacher
+from .student import Student
+
+__all__ = teacher.__all__ + student.__all__
--- a/demo/pantheon/lexical_anlysis/creator.py
+++ b/demo/pantheon/lexical_anlysis/creator.py
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Define the function to create lexical analysis model and model's data reader
+"""
+import sys
+import os
+import math
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+
+from reader import Dataset
+from ernie_reader import SequenceLabelReader
+
+from models.sequence_labeling import nets
+from models.representation.ernie import ernie_encoder, ernie_pyreader
+
+
+def create_model(args, vocab_size, num_labels, mode='train'):
+    """create lac model"""
+
+    # model's input data
+    words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
+    targets = fluid.data(
+        name='targets', shape=[-1, 1], dtype='int64', lod_level=1)
+    if mode == "train":
+        print("create model mode: ", mode)
+        teacher_crf_decode = fluid.data(
+            name='teacher_crf_decode', shape=[-1, 1], dtype='float32', lod_level=1)
+    else:
+        print("create model mode: ", mode)
+        teacher_crf_decode = None
+    
+    feed_list = [words, targets]
+    if teacher_crf_decode:
+        feed_list.append(teacher_crf_decode)
+        
+    pyreader = fluid.io.DataLoader.from_generator(
+                feed_list=feed_list,
+                capacity=200,
+                use_double_buffer=True,
+                iterable=False)
+    # for test or train process
+    avg_cost, crf_avg_cost, teacher_cost, crf_decode= nets.lex_net(
+        words, args, vocab_size, num_labels, teacher_crf_decode,for_infer=False, target=targets)
+
+    (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+     num_correct_chunks) = fluid.layers.chunk_eval(
+         input=crf_decode,
+         label=targets,
+         chunk_scheme="IOB",
+         num_chunk_types=int(math.ceil((num_labels - 1) / 2.0)))
+    chunk_evaluator = fluid.metrics.ChunkEvaluator()
+    chunk_evaluator.reset()
+    
+    ret = {
+        "pyreader": pyreader,
+        "words": words,
+        "targets": targets,
+        "avg_cost": avg_cost,
+        "crf_avg_cost": crf_avg_cost,
+        "teacher_cost": teacher_cost,
+        "crf_decode": crf_decode,
+        "precision": precision,
+        "recall": recall,
+        "f1_score": f1_score,
+        "chunk_evaluator": chunk_evaluator,
+        "num_infer_chunks": num_infer_chunks,
+        "num_label_chunks": num_label_chunks,
+        "num_correct_chunks": num_correct_chunks
+    }
+    return ret
+
+def create_lexnet_data_generator(args,
+                                 reader,
+                                 file_name,
+                                 place,
+                                 mode='train'):
+    if mode == 'train':
+        def wrapper():
+            batch_words, batch_labels, batch_emissions, seq_lens = [], [], None, []
+            emi_lens = []
+            for epoch in range(args.epoch):
+                print("data epoch: {}".format(epoch))
+                for instance in reader.file_reader(file_name, mode="train")():
+                    words, labels, emission = instance
+                    if len(seq_lens) < args.batch_size:
+                        batch_words.append(words)
+                        batch_labels.append(labels)
+                        if batch_emissions is not None:
+                            batch_emissions = np.concatenate((batch_emissions, emission))
+                        else:
+                            batch_emissions = emission
+                        seq_lens.append(len(words))
+                        emi_lens.append(emission.shape[0])
+                    if len(seq_lens) == args.batch_size:  
+   
+                        #print("batch words len", [len(seq) for seq in batch_words])
+                        #print("batch labels len", [len(seq) for seq in batch_labels])
+                        #print("emi lens:", emi_lens)
+                        #print("emission first dim:", batch_emissions.shape[0])
+                        #print("reduced seq_lens:", sum(seq_lens))
+                        t_words = fluid.create_lod_tensor(batch_words, [seq_lens], place)
+                        t_labels = fluid.create_lod_tensor(batch_labels, [seq_lens], place)
+                        t_emissions = fluid.create_lod_tensor(batch_emissions, [seq_lens], place)
+                        yield t_words, t_labels, t_emissions
+                        batch_words, batch_labels, batch_emissions, seq_lens = [], [], None, []
+                        emi_lens = []
+
+                if len(seq_lens) > 0:                
+                    t_words = fluid.create_lod_tensor(batch_words, [seq_lens], place)
+                    t_labels = fluid.create_lod_tensor(batch_labels, [seq_lens], place)
+                    t_emissions = fluid.create_lod_tensor(batch_emissions, [seq_lens], place)
+                    yield t_words, t_labels, t_emissions
+                    batch_words, batch_labels, batch_emissions, seq_lens = [], [], None, []
+
+    else:
+        def wrapper():
+            batch_words, batch_labels, seq_lens = [], [], []
+            for instance in reader.file_reader(file_name, mode="test")():
+                words, labels = instance
+                if len(seq_lens) < args.batch_size:
+                    batch_words.append(words)
+                    batch_labels.append(labels)
+                    seq_lens.append(len(words))
+                if len(seq_lens) == args.batch_size:  
+                    t_words = fluid.create_lod_tensor(batch_words, [seq_lens], place)
+                    t_labels = fluid.create_lod_tensor(batch_labels, [seq_lens], place)
+                    yield t_words, t_labels
+                    batch_words, batch_labels, seq_lens = [], [], []
+    
+            if len(seq_lens) > 0:                
+                t_words = fluid.create_lod_tensor(batch_words, [seq_lens], place)
+                t_labels = fluid.create_lod_tensor(batch_labels, [seq_lens], place)
+                yield t_words, t_labels
+                batch_words, batch_labels, seq_lens = [], [], []
+    return wrapper
+
+def create_pyreader(args,
+                    file_name,
+                    feed_list,
+                    place,
+                    model='lac',
+                    reader=None,
+                    return_reader=False,
+                    mode='train'):
+    reader = SequenceLabelReader(
+                vocab_path=args.vocab_path,
+                label_map_config=args.label_map_config,
+                max_seq_len=args.max_seq_len,
+                do_lower_case=args.do_lower_case,
+                random_seed=args.random_seed)
+    return reader.data_generator(file_name,args.batch_size,args.epoch,shuffle=False,phase="train")
+
+
+def create_ernie_model(args, ernie_config):
+    """
+    Create Model for LAC based on ERNIE encoder
+    """
+    # ERNIE's input data
+
+    src_ids = fluid.data(
+        name='src_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    sent_ids = fluid.data(
+        name='sent_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    pos_ids = fluid.data(
+        name='pos_ids', shape=[-1, args.max_seq_len, 1], dtype='int64')
+    input_mask = fluid.data(
+        name='input_mask', shape=[-1, args.max_seq_len, 1], dtype='float32')
+
+    padded_labels = fluid.data(
+        name='padded_labels', shape=[-1, args.max_seq_len, 1], dtype='int64')
+
+    seq_lens = fluid.data(
+        name='seq_lens', shape=[-1], dtype='int64', lod_level=0)
+
+    squeeze_labels = fluid.layers.squeeze(padded_labels, axes=[-1])
+
+    # ernie_pyreader
+    ernie_inputs = {
+        "src_ids": src_ids,
+        "sent_ids": sent_ids,
+        "pos_ids": pos_ids,
+        "input_mask": input_mask,
+        "seq_lens": seq_lens
+    }
+    embeddings = ernie_encoder(ernie_inputs, ernie_config=ernie_config)
+
+    padded_token_embeddings = embeddings["padded_token_embeddings"]
+
+    emission = fluid.layers.fc(
+        size=args.num_labels,
+        input=padded_token_embeddings,
+        param_attr=fluid.ParamAttr(
+            initializer=fluid.initializer.Uniform(
+                low=-args.init_bound, high=args.init_bound),
+            regularizer=fluid.regularizer.L2DecayRegularizer(
+                regularization_coeff=1e-4)),
+        num_flatten_dims=2)
+
+    crf_cost = fluid.layers.linear_chain_crf(
+        input=emission,
+        label=padded_labels,
+        param_attr=fluid.ParamAttr(
+            name='crfw', learning_rate=args.crf_learning_rate),
+        length=seq_lens)
+
+    avg_cost = fluid.layers.mean(x=crf_cost)
+    crf_decode = fluid.layers.crf_decoding(
+        input=emission,
+        param_attr=fluid.ParamAttr(name='crfw'),
+        length=seq_lens)
+
+    (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+     num_correct_chunks) = fluid.layers.chunk_eval(
+         input=crf_decode,
+         label=squeeze_labels,
+         chunk_scheme="IOB",
+         num_chunk_types=int(math.ceil((args.num_labels - 1) / 2.0)),
+         seq_length=seq_lens)
+    chunk_evaluator = fluid.metrics.ChunkEvaluator()
+    chunk_evaluator.reset()
+
+    ret = {
+        "feed_list":
+        [src_ids, sent_ids, pos_ids, input_mask, padded_labels, seq_lens],
+        "words": src_ids,
+        "pos_ids":pos_ids,
+        "sent_ids":sent_ids,
+        "input_mask":input_mask,
+        "labels": padded_labels,
+        "seq_lens": seq_lens,
+        "avg_cost": avg_cost,
+        "crf_decode": crf_decode,
+        "precision": precision,
+        "recall": recall,
+        "f1_score": f1_score,
+        "chunk_evaluator": chunk_evaluator,
+        "num_infer_chunks": num_infer_chunks,
+        "num_label_chunks": num_label_chunks,
+        "num_correct_chunks": num_correct_chunks,
+        "emission":emission, 
+        "alpha": None
+    }
+
+    return ret
--- a/demo/pantheon/lexical_anlysis/downloads.py
+++ b/demo/pantheon/lexical_anlysis/downloads.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Download script, download dataset and pretrain models.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import io
+import os
+import sys
+import time
+import hashlib
+import tarfile
+import requests
+
+FILE_INFO = {
+    'BASE_URL': 'https://baidu-nlp.bj.bcebos.com/',
+    'DATA': {
+        'name': 'lexical_analysis-dataset-2.0.0.tar.gz',
+        'md5': '71e4a9a36d0f0177929a1bccedca7dba'
+    },
+    'FINETURN_MODEL': {
+        'name': 'lexical_analysis_finetuned-1.0.0.tar.gz',
+        'md5': "ee2c7614b06dcfd89561fbbdaac34342"
+    },
+    'CONF': {
+        'name': 'conf.tar.gz',
+        'md5': "7a0fe28db46db496fff4361eebaa6515", 
+        'url': 'https://paddlemodels.bj.bcebos.com/PaddleSlim/pantheon/lexical_analysis/',
+    }
+}
+
+
+def usage():
+    desc = ("\nDownload datasets and pretrained models for LAC.\n"
+            "Usage:\n"
+            "   1. python download.py all\n"
+            "   2. python download.py dataset\n"
+            "   3. python download.py finetuned\n"
+            "   4. python download.py conf\n")
+    print(desc)
+
+
+def md5file(fname):
+    hash_md5 = hashlib.md5()
+    with io.open(fname, "rb") as fin:
+        for chunk in iter(lambda: fin.read(4096), b""):
+            hash_md5.update(chunk)
+    return hash_md5.hexdigest()
+
+
+def extract(fname, dir_path):
+    """
+    Extract tar.gz file
+    """
+    try:
+        tar = tarfile.open(fname, "r")
+        file_names = tar.getnames()
+        for file_name in file_names:
+            tar.extract(file_name, dir_path)
+            print(file_name)
+        tar.close()
+    except Exception as e:
+        raise e
+
+
+def _download(url, filename, md5sum):
+    """
+    Download file and check md5
+    """
+    retry = 0
+    retry_limit = 3
+    chunk_size = 4096
+    while not (os.path.exists(filename) and md5file(filename) == md5sum):
+        if retry < retry_limit:
+            retry += 1
+        else:
+            raise RuntimeError(
+                "Cannot download dataset ({0}) with retry {1} times.".format(
+                    url, retry_limit))
+        try:
+            start = time.time()
+            size = 0
+            res = requests.get(url, stream=True)
+            filesize = int(res.headers['content-length'])
+            if res.status_code == 200:
+                print("[Filesize]: %0.2f MB" % (filesize / 1024 / 1024))
+                # save by chunk
+                with io.open(filename, "wb") as fout:
+                    for chunk in res.iter_content(chunk_size=chunk_size):
+                        if chunk:
+                            fout.write(chunk)
+                            size += len(chunk)
+                            pr = '>' * int(size * 50 / filesize)
+                            print(
+                                '\r[Process ]: %s%.2f%%' %
+                                (pr, float(size / filesize * 100)),
+                                end='')
+            end = time.time()
+            print("\n[CostTime]: %.2f s" % (end - start))
+        except Exception as e:
+            print(e)
+
+
+def download(name, dir_path):
+    # import ipdb; ipdb.set_trace()
+    if name == 'CONF':
+        url = FILE_INFO[name]['url'] + FILE_INFO[name]['name']
+    else:
+        url = FILE_INFO['BASE_URL'] + FILE_INFO[name]['name']
+    file_path = os.path.join(dir_path, FILE_INFO[name]['name'])
+
+    if not os.path.exists(dir_path):
+        os.makedirs(dir_path)
+
+    # download data
+    print("Downloading : %s" % name)
+    _download(url, file_path, FILE_INFO[name]['md5'])
+
+    # extract data
+    print("Extracting : %s" % file_path)
+    extract(file_path, dir_path)
+    os.remove(file_path)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) != 2:
+        usage()
+        sys.exit(1)
+    pwd = os.path.join(os.path.dirname(__file__), './')
+    ernie_dir = os.path.join(os.path.dirname(__file__), './pretrained')
+
+    if sys.argv[1] == 'all':
+        download('DATA', pwd)
+        download('FINETURN_MODEL', pwd)
+        download('CONF', pwd)
+
+    if sys.argv[1] == "dataset":
+        download('DATA', pwd)
+
+    elif sys.argv[1] == "finetuned":
+        download('FINETURN_MODEL', pwd)
+
+    elif sys.argv[1] == "conf":
+        download('CONF', pwd)
+
+    else:
+        usage()
+
--- a/demo/pantheon/lexical_anlysis/ernie_reader.py
+++ b/demo/pantheon/lexical_anlysis/ernie_reader.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This module provides reader for ernie model
+"""
+
+import sys
+
+from collections import namedtuple
+import numpy as np
+
+sys.path.append("..")
+from preprocess.ernie.task_reader import BaseReader, tokenization
+
+
+def pad_batch_data(insts,
+                   pad_idx=0,
+                   max_len=128,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False,
+                   return_seq_lens=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and input mask.
+    """
+    return_list = []
+    # max_len = max(len(inst) for inst in insts)
+    max_len = max_len
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    inst_data = np.array(
+        [inst + list([pad_idx] * (max_len - len(inst))) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len, 1])]
+
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len, 1])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    if return_seq_lens:
+        seq_lens = np.array([len(inst) for inst in insts])
+        return_list += [seq_lens.astype("int64").reshape([-1])]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+class SequenceLabelReader(BaseReader):
+    """SequenceLabelReader"""
+
+    def _pad_batch_records(self, batch_records):
+        batch_token_ids = [record.token_ids for record in batch_records]
+        batch_text_type_ids = [record.text_type_ids for record in batch_records]
+        batch_position_ids = [record.position_ids for record in batch_records]
+        batch_label_ids = [record.label_ids for record in batch_records]
+
+        # padding
+        padded_token_ids, input_mask, batch_seq_lens = pad_batch_data(
+            batch_token_ids,
+            max_len=self.max_seq_len,
+            pad_idx=self.pad_id,
+            return_input_mask=True,
+            return_seq_lens=True)
+        padded_text_type_ids = pad_batch_data(
+            batch_text_type_ids, max_len=self.max_seq_len, pad_idx=self.pad_id)
+        padded_position_ids = pad_batch_data(
+            batch_position_ids, max_len=self.max_seq_len, pad_idx=self.pad_id)
+        padded_label_ids = pad_batch_data(
+            batch_label_ids,
+            max_len=self.max_seq_len,
+            pad_idx=len(self.label_map) - 1)
+
+        return_list = [
+            padded_token_ids, padded_text_type_ids, padded_position_ids,
+            input_mask, padded_label_ids, batch_seq_lens
+        ]
+        return return_list
+
+    def _reseg_token_label(self, tokens, labels, tokenizer):
+        assert len(tokens) == len(labels)
+        ret_tokens = []
+        ret_labels = []
+        for token, label in zip(tokens, labels):
+            sub_token = tokenizer.tokenize(token)
+            if len(sub_token) == 0:
+                continue
+            ret_tokens.extend(sub_token)
+            ret_labels.append(label)
+            if len(sub_token) < 2:
+                continue
+            sub_label = label
+            if label.startswith("B-"):
+                sub_label = "I-" + label[2:]
+            ret_labels.extend([sub_label] * (len(sub_token) - 1))
+
+        assert len(ret_tokens) == len(ret_labels)
+        return ret_tokens, ret_labels
+
+    def _convert_example_to_record(self, example, max_seq_length, tokenizer):
+        tokens = tokenization.convert_to_unicode(example.text_a).split(u"")
+        labels = tokenization.convert_to_unicode(example.label).split(u"")
+        tokens, labels = self._reseg_token_label(tokens, labels, tokenizer)
+
+        if len(tokens) > max_seq_length - 2:
+            tokens = tokens[0:(max_seq_length - 2)]
+            labels = labels[0:(max_seq_length - 2)]
+        tokens = ["[CLS]"] + tokens + ["[SEP]"]
+        token_ids = tokenizer.convert_tokens_to_ids(tokens)
+        position_ids = list(range(len(token_ids)))
+        text_type_ids = [0] * len(token_ids)
+        no_entity_id = len(self.label_map) - 1
+        labels = [
+            label if label in self.label_map else u"O" for label in labels
+        ]
+        label_ids = [no_entity_id] + [
+            self.label_map[label] for label in labels
+        ] + [no_entity_id]
+
+        Record = namedtuple(
+            'Record',
+            ['token_ids', 'text_type_ids', 'position_ids', 'label_ids'])
+        record = Record(
+            token_ids=token_ids,
+            text_type_ids=text_type_ids,
+            position_ids=position_ids,
+            label_ids=label_ids)
+        return record
--- a/demo/pantheon/lexical_anlysis/eval.py
+++ b/demo/pantheon/lexical_anlysis/eval.py
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+import sys
+
+import paddle.fluid as fluid
+import paddle
+
+import model_utils
+import reader
+import creator
+sys.path.append('models/')
+from model_check import check_cuda
+from model_check import check_version
+
+parser = argparse.ArgumentParser(__doc__)
+# 1. model parameters
+model_g = model_utils.ArgumentGroup(parser, "model", "model configuration")
+model_g.add_arg("word_emb_dim", int, 128,
+                "The dimension in which a word is embedded.")
+model_g.add_arg("grnn_hidden_dim", int, 128,
+                "The number of hidden nodes in the GRNN layer.")
+model_g.add_arg("bigru_num", int, 2,
+                "The number of bi_gru layers in the network.")
+model_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
+
+# 2. data parameters
+data_g = model_utils.ArgumentGroup(parser, "data", "data paths")
+data_g.add_arg("word_dict_path", str, "./conf/word.dic",
+               "The path of the word dictionary.")
+data_g.add_arg("label_dict_path", str, "./conf/tag.dic",
+               "The path of the label dictionary.")
+data_g.add_arg("word_rep_dict_path", str, "./conf/q2b.dic",
+               "The path of the word replacement Dictionary.")
+data_g.add_arg("test_data", str, "./data/test.tsv",
+               "The folder where the training data is located.")
+data_g.add_arg("init_checkpoint", str, "./model_baseline", "Path to init model")
+data_g.add_arg(
+    "batch_size", int, 200,
+    "The number of sequences contained in a mini-batch, "
+    "or the maximum number of tokens (include paddings) contained in a mini-batch."
+)
+
+
+def do_eval(args):
+    print('do_eval...........')
+    dataset = reader.Dataset(args)
+
+    test_program = fluid.Program()
+    with fluid.program_guard(test_program, fluid.default_startup_program()):
+        with fluid.unique_name.guard():
+            test_ret = creator.create_model(
+                args, dataset.vocab_size, dataset.num_labels, mode='test')
+    test_program = test_program.clone(for_test=True)
+
+    # init executor
+    if args.use_cuda:
+        place = fluid.CUDAPlace(int(os.getenv('FLAGS_selected_gpus', '0')))
+    else:
+        place = fluid.CPUPlace()
+
+    pyreader = creator.create_pyreader(
+        args,
+        file_name=args.test_data,
+        feed_list=test_ret['feed_list'],
+        place=place,
+        model='lac',
+        reader=dataset,
+        mode='test')
+
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    # load model
+    model_utils.init_checkpoint(exe, args.init_checkpoint, test_program)
+    test_process(
+        exe=exe, program=test_program, reader=pyreader, test_ret=test_ret)
+
+
+def test_process(exe, program, reader, test_ret):
+    """
+    the function to execute the infer process
+    :param exe: the fluid Executor
+    :param program: the infer_program
+    :param reader: data reader
+    :return: the list of prediction result
+    """
+    print('test_process...........')
+    test_ret["chunk_evaluator"].reset()
+    start_time = time.time()
+    reader.start()
+    while True:
+        try:
+            nums_infer, nums_label, nums_correct = exe.run(
+		program,
+		fetch_list=[
+		    test_ret["num_infer_chunks"],
+		    test_ret["num_label_chunks"],
+		    test_ret["num_correct_chunks"],
+		])
+            test_ret["chunk_evaluator"].update(nums_infer, nums_label, nums_correct)
+        except fluid.core.EOFException:
+            reader.reset()
+            break
+              
+    precision, recall, f1 = test_ret["chunk_evaluator"].eval()
+    end_time = time.time()
+    print("[test] P: %.5f, R: %.5f, F1: %.5f, elapsed time: %.3f s" %
+          (precision, recall, f1, end_time - start_time))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    check_cuda(args.use_cuda)
+    check_version()
+    do_eval(args)
--- a/demo/pantheon/lexical_anlysis/model_utils.py
+++ b/demo/pantheon/lexical_anlysis/model_utils.py
--- a/demo/pantheon/lexical_anlysis/models/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/model_check.py
+++ b/demo/pantheon/lexical_anlysis/models/model_check.py
+#encoding=utf8
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import paddle
+import paddle.fluid as fluid
+
+
+def check_cuda(use_cuda, err = \
+    "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
+                                                                                                                     ):
+    """
+    Log error and exit when set use_gpu=true in paddlepaddle
+    cpu version.
+    """
+    try:
+        if use_cuda == True and fluid.is_compiled_with_cuda() == False:
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+def check_version():
+        """
+        Log error and exit when the installed version of paddlepaddle is
+        not satisfied.
+        """
+        err = "PaddlePaddle version 1.6 or higher is required, " \
+            "or a suitable develop version is satisfied as well. \n" \
+            "Please make sure the version is good with your code." \
+
+        try:
+            fluid.require_version('1.6.0')
+        except Exception as e:
+            print(err)
+            sys.exit(1)
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+        "or a suitable develop version is satisfied as well. \n" \
+        "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        print(err)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    check_cuda(True)
+
+    check_cuda(False)
+
+    check_cuda(True, "This is only for testing.")
--- a/demo/pantheon/lexical_anlysis/models/representation/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/representation/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/representation/ernie.py
+++ b/demo/pantheon/lexical_anlysis/models/representation/ernie.py
--- a/demo/pantheon/lexical_anlysis/models/sequence_labeling/__init__.py
+++ b/demo/pantheon/lexical_anlysis/models/sequence_labeling/__init__.py
--- a/demo/pantheon/lexical_anlysis/models/sequence_labeling/nets.py
+++ b/demo/pantheon/lexical_anlysis/models/sequence_labeling/nets.py
--- a/demo/pantheon/lexical_anlysis/models/transformer_encoder.py
+++ b/demo/pantheon/lexical_anlysis/models/transformer_encoder.py
--- a/demo/pantheon/lexical_anlysis/preprocess/__init__.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/__init__.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/__init__.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/__init__.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/task_reader.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/task_reader.py
--- a/demo/pantheon/lexical_anlysis/preprocess/ernie/tokenization.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/ernie/tokenization.py
--- a/demo/pantheon/lexical_anlysis/preprocess/padding.py
+++ b/demo/pantheon/lexical_anlysis/preprocess/padding.py
--- a/demo/pantheon/lexical_anlysis/reader.py
+++ b/demo/pantheon/lexical_anlysis/reader.py
--- a/demo/pantheon/lexical_anlysis/run_student.sh
+++ b/demo/pantheon/lexical_anlysis/run_student.sh
+#!/bin/bash
+
+export CUDA_VISIBLE_DEVICES=5,6
+python -u train_student.py \
+    --train_data ./data/train.tsv \
+    --test_data ./data/test.tsv \
+    --model_save_dir ./teacher_ernie_init_lac_1gru_emb128 \
+    --validation_steps 1000 \
+    --save_steps 1000 \
+    --print_steps 100 \
+    --batch_size 32 \
+    --epoch 10 \
+    --traindata_shuffle_buffer 20000 \
+    --word_emb_dim 128 \
+    --grnn_hidden_dim 128 \
+    --bigru_num 1 \
+    --base_learning_rate 1e-3 \
+    --emb_learning_rate 2 \
+    --crf_learning_rate 0.2 \
+    --word_dict_path ./conf/word.dic \
+    --label_dict_path ./conf/tag.dic \
+    --word_rep_dict_path ./conf/q2b.dic \
+    --enable_ce false \
+    --use_cuda true \
+    --in_address "127.0.0.1:5002"
+
--- a/demo/pantheon/lexical_anlysis/run_teacher.sh
+++ b/demo/pantheon/lexical_anlysis/run_teacher.sh
+#!/bin/bash
+
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=1
+export FLAGS_fraction_of_gpu_memory_to_use=0.99
+
+export CUDA_VISIBLE_DEVICES=5,6     # which GPU to use
+ERNIE_FINETUNED_MODEL_PATH=./model_finetuned
+DATA_PATH=./data/
+
+python -u teacher_ernie.py \
+    --ernie_config_path "conf/ernie_config.json" \
+    --init_checkpoint "${ERNIE_FINETUNED_MODEL_PATH}" \
+    --init_bound 0.1 \
+    --vocab_path "conf/vocab.txt" \
+    --batch_size 32 \
+    --random_seed 0 \
+    --num_labels 57 \
+    --max_seq_len 128 \
+    --test_data "${DATA_PATH}/train.tsv" \
+    --label_map_config "./conf/label_map.json" \
+    --do_lower_case true \
+    --use_cuda true \
+    --out_port=5002
+
--- a/demo/pantheon/lexical_anlysis/teacher_ernie.py
+++ b/demo/pantheon/lexical_anlysis/teacher_ernie.py
--- a/demo/pantheon/lexical_anlysis/train_student.py
+++ b/demo/pantheon/lexical_anlysis/train_student.py
--- a/demo/prune/README.md
+++ b/demo/prune/README.md
--- a/demo/prune/eval.py
+++ b/demo/prune/eval.py
--- a/demo/prune/fpgm_mobilenetv1_f-50_train.sh
+++ b/demo/prune/fpgm_mobilenetv1_f-50_train.sh
--- a/demo/prune/fpgm_mobilenetv2_f-50_train.sh
+++ b/demo/prune/fpgm_mobilenetv2_f-50_train.sh
--- a/demo/prune/fpgm_resnet34_f-42_train.sh
+++ b/demo/prune/fpgm_resnet34_f-42_train.sh
--- a/demo/prune/fpgm_resnet34_f-50_train.sh
+++ b/demo/prune/fpgm_resnet34_f-50_train.sh
--- a/demo/prune/image_classification_pruning_quick_start.ipynb
+++ b/demo/prune/image_classification_pruning_quick_start.ipynb
--- a/demo/prune/train.py
+++ b/demo/prune/train.py
--- a/demo/quant/pact_quant_aware/README.md
+++ b/demo/quant/pact_quant_aware/README.md
--- a/demo/quant/pact_quant_aware/image/activation_dist.png
+++ b/demo/quant/pact_quant_aware/image/activation_dist.png
--- a/demo/quant/pact_quant_aware/image/pact.png
+++ b/demo/quant/pact_quant_aware/image/pact.png
--- a/demo/quant/pact_quant_aware/image/pact_our.png
+++ b/demo/quant/pact_quant_aware/image/pact_our.png
--- a/demo/quant/pact_quant_aware/pact.py
+++ b/demo/quant/pact_quant_aware/pact.py
--- a/demo/quant/pact_quant_aware/train.py
+++ b/demo/quant/pact_quant_aware/train.py
--- a/demo/quant/quant_aware/image_classification_training_aware_quantization_quick_start.ipynb
+++ b/demo/quant/quant_aware/image_classification_training_aware_quantization_quick_start.ipynb
--- a/demo/quant/quant_aware/train.py
+++ b/demo/quant/quant_aware/train.py
--- a/demo/quant/quant_embedding/infer.py
+++ b/demo/quant/quant_embedding/infer.py
--- a/demo/quant/quant_post/README.md
+++ b/demo/quant/quant_post/README.md
--- a/demo/quant/quant_post/eval.py
+++ b/demo/quant/quant_post/eval.py
--- a/demo/quant/quant_post/image_classification_post_training_quantization_quick_start.ipynb
+++ b/demo/quant/quant_post/image_classification_post_training_quantization_quick_start.ipynb
--- a/demo/sensitive/image_classification_sensitivity_analysis.ipynb
+++ b/demo/sensitive/image_classification_sensitivity_analysis.ipynb
--- a/demo/sensitive/train.py
+++ b/demo/sensitive/train.py
--- a/demo/sensitive_prune/greedy_prune.py
+++ b/demo/sensitive_prune/greedy_prune.py
--- a/demo/sensitive_prune/prune.py
+++ b/demo/sensitive_prune/prune.py
--- a/demo/slimfacenet/README.md
+++ b/demo/slimfacenet/README.md
--- a/demo/slimfacenet/dataloader/__init__.py
+++ b/demo/slimfacenet/dataloader/__init__.py
+from .casia import CASIA_Face
--- a/demo/slimfacenet/dataloader/casia.py
+++ b/demo/slimfacenet/dataloader/casia.py
--- a/demo/slimfacenet/dataloader/lfw.py
+++ b/demo/slimfacenet/dataloader/lfw.py
--- a/demo/slimfacenet/lfw_eval.py
+++ b/demo/slimfacenet/lfw_eval.py
--- a/demo/slimfacenet/slim_eval.sh
+++ b/demo/slimfacenet/slim_eval.sh
--- a/demo/slimfacenet/slim_quant.sh
+++ b/demo/slimfacenet/slim_quant.sh
--- a/demo/slimfacenet/slim_train.sh
+++ b/demo/slimfacenet/slim_train.sh
--- a/demo/slimfacenet/train_eval.py
+++ b/demo/slimfacenet/train_eval.py
--- a/docs/en/Makefile
+++ b/docs/en/Makefile
--- a/docs/en/api_en/index_en.rst
+++ b/docs/en/api_en/index_en.rst
--- a/docs/en/api_en/paddleslim.nas.darts.rst
+++ b/docs/en/api_en/paddleslim.nas.darts.rst
--- a/docs/en/conf.py
+++ b/docs/en/conf.py
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
--- a/docs/en/index_en.rst
+++ b/docs/en/index_en.rst
--- a/docs/en/intro_en.md
+++ b/docs/en/intro_en.md
--- a/docs/en/quick_start/distillation_tutorial_en.md
+++ b/docs/en/quick_start/distillation_tutorial_en.md
--- a/docs/en/quick_start/index_en.rst
+++ b/docs/en/quick_start/index_en.rst
--- a/docs/en/quick_start/nas_tutorial_en.md
+++ b/docs/en/quick_start/nas_tutorial_en.md
--- a/docs/en/quick_start/pruning_tutorial_en.md
+++ b/docs/en/quick_start/pruning_tutorial_en.md
--- a/docs/en/quick_start/quant_aware_tutorial_en.md
+++ b/docs/en/quick_start/quant_aware_tutorial_en.md
--- a/docs/en/quick_start/quant_post_tutorial_en.md
+++ b/docs/en/quick_start/quant_post_tutorial_en.md
--- a/docs/en/tutorials/image_classification_sensitivity_analysis_tutorial_en.md
+++ b/docs/en/tutorials/image_classification_sensitivity_analysis_tutorial_en.md
--- a/docs/zh_cn/CHANGELOG.md
+++ b/docs/zh_cn/CHANGELOG.md
--- a/docs/zh_cn/Makefile
+++ b/docs/zh_cn/Makefile
--- a/docs/zh_cn/api_cn/common_index.rst
+++ b/docs/zh_cn/api_cn/common_index.rst
--- a/docs/zh_cn/api_cn/custom_rl_controller.md
+++ b/docs/zh_cn/api_cn/custom_rl_controller.md
--- a/docs/zh_cn/api_cn/darts.rst
+++ b/docs/zh_cn/api_cn/darts.rst
--- a/docs/zh_cn/api_cn/distill_index.rst
+++ b/docs/zh_cn/api_cn/distill_index.rst
--- a/docs/zh_cn/api_cn/index.rst
+++ b/docs/zh_cn/api_cn/index.rst
--- a/docs/zh_cn/api_cn/nas_api.rst
+++ b/docs/zh_cn/api_cn/nas_api.rst
--- a/docs/zh_cn/api_cn/nas_index.rst
+++ b/docs/zh_cn/api_cn/nas_index.rst
--- a/docs/zh_cn/api_cn/pantheon_api.md
+++ b/docs/zh_cn/api_cn/pantheon_api.md
--- a/docs/zh_cn/api_cn/prune_api.rst
+++ b/docs/zh_cn/api_cn/prune_api.rst
--- a/docs/zh_cn/api_cn/prune_index.rst
+++ b/docs/zh_cn/api_cn/prune_index.rst
--- a/docs/zh_cn/api_cn/quant_index.rst
+++ b/docs/zh_cn/api_cn/quant_index.rst
--- a/docs/zh_cn/api_cn/quantization_api.rst
+++ b/docs/zh_cn/api_cn/quantization_api.rst
--- a/docs/zh_cn/api_cn/single_distiller_api.rst
+++ b/docs/zh_cn/api_cn/single_distiller_api.rst
--- a/docs/zh_cn/conf.py
+++ b/docs/zh_cn/conf.py
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
--- a/docs/zh_cn/index_en.rst
+++ b/docs/zh_cn/index_en.rst
--- a/docs/zh_cn/intro.md
+++ b/docs/zh_cn/intro.md
--- a/docs/zh_cn/model_zoo/distillation_model_zoo.md
+++ b/docs/zh_cn/model_zoo/distillation_model_zoo.md
--- a/docs/zh_cn/model_zoo/index.rst
+++ b/docs/zh_cn/model_zoo/index.rst
--- a/docs/zh_cn/model_zoo/model_zoo.md
+++ b/docs/zh_cn/model_zoo/model_zoo.md
--- a/docs/zh_cn/model_zoo/nas_model_zoo.md
+++ b/docs/zh_cn/model_zoo/nas_model_zoo.md
--- a/docs/zh_cn/model_zoo/prune_model_zoo.md
+++ b/docs/zh_cn/model_zoo/prune_model_zoo.md
--- a/docs/zh_cn/model_zoo/quant_model_zoo.md
+++ b/docs/zh_cn/model_zoo/quant_model_zoo.md
--- a/docs/zh_cn/quick_start/distillation_tutorial.md
+++ b/docs/zh_cn/quick_start/distillation_tutorial.md
--- a/docs/zh_cn/quick_start/index.rst
+++ b/docs/zh_cn/quick_start/index.rst
--- a/docs/zh_cn/quick_start/nas_tutorial.md
+++ b/docs/zh_cn/quick_start/nas_tutorial.md
--- a/docs/zh_cn/quick_start/pruning_tutorial.md
+++ b/docs/zh_cn/quick_start/pruning_tutorial.md
--- a/docs/zh_cn/quick_start/quant_aware_tutorial.md
+++ b/docs/zh_cn/quick_start/quant_aware_tutorial.md
--- a/docs/zh_cn/quick_start/quant_post_tutorial.md
+++ b/docs/zh_cn/quick_start/quant_post_tutorial.md
--- a/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_mkldnn_quant_aware_tutorial.md
--- a/docs/zh_cn/tutorials/image_classification_nas_quick_start.ipynb
+++ b/docs/zh_cn/tutorials/image_classification_nas_quick_start.ipynb
--- a/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md
+++ b/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md
--- a/docs/zh_cn/tutorials/index.rst
+++ b/docs/zh_cn/tutorials/index.rst
--- a/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md
+++ b/docs/zh_cn/tutorials/paddledetection_slim_quantization_tutorial.md
--- a/docs/zh_cn/tutorials/darts_nas_turorial.ipynb
+++ b/docs/zh_cn/tutorials/darts_nas_turorial.ipynb
--- a/docs/zh_cn/tutorials/darts_nas_turorial.md
+++ b/docs/zh_cn/tutorials/darts_nas_turorial.md
--- a/paddleslim/analysis/__init__.py
+++ b/paddleslim/analysis/__init__.py
--- a/paddleslim/analysis/flops.py
+++ b/paddleslim/analysis/flops.py
--- a/paddleslim/common/__init__.py
+++ b/paddleslim/common/__init__.py
--- a/paddleslim/common/analyze_helper.py
+++ b/paddleslim/common/analyze_helper.py
--- a/paddleslim/common/client.py
+++ b/paddleslim/common/client.py
--- a/paddleslim/common/controller_server.py
+++ b/paddleslim/common/controller_server.py
--- a/paddleslim/common/RL_controller/__init__.py
+++ b/paddleslim/common/RL_controller/__init__.py
--- a/paddleslim/common/rl_controller/base_env.py
+++ b/paddleslim/common/rl_controller/base_env.py
--- a/paddleslim/common/RL_controller/DDPG/__init__.py
+++ b/paddleslim/common/RL_controller/DDPG/__init__.py
--- a/paddleslim/common/RL_controller/DDPG/DDPGController.py
+++ b/paddleslim/common/RL_controller/DDPG/DDPGController.py
--- a/paddleslim/common/RL_controller/DDPG/ddpg_model.py
+++ b/paddleslim/common/RL_controller/DDPG/ddpg_model.py
--- a/paddleslim/common/RL_controller/DDPG/noise.py
+++ b/paddleslim/common/RL_controller/DDPG/noise.py
--- a/paddleslim/common/RL_controller/LSTM/__init__.py
+++ b/paddleslim/common/RL_controller/LSTM/__init__.py
--- a/paddleslim/common/RL_controller/LSTM/LSTM_Controller.py
+++ b/paddleslim/common/RL_controller/LSTM/LSTM_Controller.py
--- a/paddleslim/common/RL_controller/utils.py
+++ b/paddleslim/common/RL_controller/utils.py
--- a/paddleslim/common/server.py
+++ b/paddleslim/common/server.py
--- a/paddleslim/dist/__init__.py
+++ b/paddleslim/dist/__init__.py
--- a/paddleslim/dist/dml.py
+++ b/paddleslim/dist/dml.py
--- a/paddleslim/models/__init__.py
+++ b/paddleslim/models/__init__.py
--- a/paddleslim/models/dygraph/__init__.py
+++ b/paddleslim/models/dygraph/__init__.py
--- a/paddleslim/models/dygraph/mobilenet.py
+++ b/paddleslim/models/dygraph/mobilenet.py
--- a/paddleslim/models/dygraph/resnet.py
+++ b/paddleslim/models/dygraph/resnet.py
--- a/paddleslim/models/slim_mobilenet.py
+++ b/paddleslim/models/slim_mobilenet.py
--- a/paddleslim/models/slimfacenet.py
+++ b/paddleslim/models/slimfacenet.py
--- a/paddleslim/nas/darts/__init__.py
+++ b/paddleslim/nas/darts/__init__.py
--- a/paddleslim/nas/darts/architect.py
+++ b/paddleslim/nas/darts/architect.py
--- a/paddleslim/nas/darts/architect_for_bert.py
+++ b/paddleslim/nas/darts/architect_for_bert.py
--- a/paddleslim/nas/darts/search_space/__init__.py
+++ b/paddleslim/nas/darts/search_space/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/bert.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/bert.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/model/transformer_encoder.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/model/transformer_encoder.py
--- a/paddleslim/nas/darts/search_space/conv_bert/optimization.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/optimization.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/batching.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/batching.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/cls.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/cls.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/pretraining.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/pretraining.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/squad.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/squad.py
--- a/paddleslim/nas/darts/search_space/conv_bert/reader/tokenization.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/reader/tokenization.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/__init__.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/__init__.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/convert_static_to_dygraph.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/convert_static_to_dygraph.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/fp16.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/fp16.py
--- a/paddleslim/nas/darts/search_space/conv_bert/utils/init.py
+++ b/paddleslim/nas/darts/search_space/conv_bert/utils/init.py
--- a/paddleslim/nas/darts/train_search.py
+++ b/paddleslim/nas/darts/train_search.py
--- a/paddleslim/nas/early_stop/median_stop/median_stop.py
+++ b/paddleslim/nas/early_stop/median_stop/median_stop.py
--- a/paddleslim/nas/rl_nas.py
+++ b/paddleslim/nas/rl_nas.py
--- a/paddleslim/prune/criterion.py
+++ b/paddleslim/prune/criterion.py
--- a/paddleslim/prune/group_param.py
+++ b/paddleslim/prune/group_param.py
--- a/paddleslim/prune/idx_selector.py
+++ b/paddleslim/prune/idx_selector.py
--- a/paddleslim/prune/prune_walker.py
+++ b/paddleslim/prune/prune_walker.py
--- a/paddleslim/prune/pruner.py
+++ b/paddleslim/prune/pruner.py
--- a/paddleslim/quant/__init__.py
+++ b/paddleslim/quant/__init__.py
--- a/paddleslim/quant/quant_embedding.py
+++ b/paddleslim/quant/quant_embedding.py
--- a/paddleslim/quant/quanter.py
+++ b/paddleslim/quant/quanter.py
--- a/paddleslim/teachers/__init__.py
+++ b/paddleslim/teachers/__init__.py
--- a/paddleslim/teachers/bert/__init__.py
+++ b/paddleslim/teachers/bert/__init__.py
--- a/paddleslim/teachers/bert/cls.py
+++ b/paddleslim/teachers/bert/cls.py
--- a/paddleslim/teachers/bert/model/__init__.py
+++ b/paddleslim/teachers/bert/model/__init__.py
--- a/paddleslim/teachers/bert/model/bert.py
+++ b/paddleslim/teachers/bert/model/bert.py
--- a/paddleslim/teachers/bert/model/cls.py
+++ b/paddleslim/teachers/bert/model/cls.py
--- a/paddleslim/teachers/bert/model/transformer_encoder.py
+++ b/paddleslim/teachers/bert/model/transformer_encoder.py
--- a/paddleslim/teachers/bert/optimization.py
+++ b/paddleslim/teachers/bert/optimization.py
--- a/paddleslim/teachers/bert/reader/__init__.py
+++ b/paddleslim/teachers/bert/reader/__init__.py
--- a/paddleslim/teachers/bert/reader/batching.py
+++ b/paddleslim/teachers/bert/reader/batching.py
--- a/paddleslim/teachers/bert/reader/cls.py
+++ b/paddleslim/teachers/bert/reader/cls.py
--- a/paddleslim/teachers/bert/reader/tokenization.py
+++ b/paddleslim/teachers/bert/reader/tokenization.py
--- a/paddleslim/teachers/bert/utils/__init__.py
+++ b/paddleslim/teachers/bert/utils/__init__.py
--- a/paddleslim/teachers/bert/utils/convert_static_to_dygraph.py
+++ b/paddleslim/teachers/bert/utils/convert_static_to_dygraph.py
--- a/paddleslim/teachers/bert/utils/fp16.py
+++ b/paddleslim/teachers/bert/utils/fp16.py
--- a/paddleslim/teachers/bert/utils/init.py
+++ b/paddleslim/teachers/bert/utils/init.py
--- a/tests/test_auto_prune.py
+++ b/tests/test_auto_prune.py
--- a/tests/test_darts.py
+++ b/tests/test_darts.py
--- a/tests/test_earlystop.py
+++ b/tests/test_earlystop.py
--- a/tests/test_flops.py
+++ b/tests/test_flops.py
--- a/tests/test_fpgm_prune.py
+++ b/tests/test_fpgm_prune.py
--- a/tests/test_fsp_loss.py
+++ b/tests/test_fsp_loss.py
--- a/tests/test_l2_loss.py
+++ b/tests/test_l2_loss.py
--- a/tests/test_loss.py
+++ b/tests/test_loss.py
--- a/tests/test_optimal_threshold.py
+++ b/tests/test_optimal_threshold.py
--- a/tests/test_prune.py
+++ b/tests/test_prune.py
--- a/tests/test_prune_walker.py
+++ b/tests/test_prune_walker.py
--- a/tests/test_quant_aware.py
+++ b/tests/test_quant_aware.py
--- a/tests/test_quant_aware_user_defined.py
+++ b/tests/test_quant_aware_user_defined.py
--- a/tests/test_quant_post.py
+++ b/tests/test_quant_post.py
--- a/tests/test_quant_post_only_weight.py
+++ b/tests/test_quant_post_only_weight.py
--- a/tests/test_rl_nas.py
+++ b/tests/test_rl_nas.py
--- a/tests/test_sensitivity.py
+++ b/tests/test_sensitivity.py
--- a/tests/test_slim_prune.py
+++ b/tests/test_slim_prune.py
--- a/tests/test_soft_label_loss.py
+++ b/tests/test_soft_label_loss.py