[Cherry-Pick] Analysis Quant (#1599)

45c8f7ce · Chang Xu · GitHub · 9b5c5202 · 45c8f7ce · 45c8f7ce
13 changed file
--- a/docs/zh_cn/tutorials/quant/AnalysisQuant.md
+++ b/docs/zh_cn/tutorials/quant/AnalysisQuant.md
-# 量化分析工具详细教程
+# PTQ(Post Training Quantization)量化分析工具详细教程
 ## 1. 量化分析工具功能
-1. statistical_analyse：
+1. 统计分析(statistical_analyse)：
    - 可视化激活和权重箱状图。箱状图可发现是否出现离群点。
    - 可视化权重和激活直方分布图。直方分布图可观察更具体的数值分布。
-    - 提供量化前后权重和激活的具体数据信息，包括min，max，mean，std等
+    - 提供量化前后权重和激活的具体数据信息，包括min，max，mean，std等。
-2. metric_error_analyse：
+2. 精度误差分析(metric_error_analyse)：
    - 遍历量化模型的每层，并计算量化后精度。该功能可以定位具体某层导致的量化损失。
-3. get_target_quant_model：
+3. 获取目标模型(get_target_quant_model)：
    - 输入预期精度，直接产出符合预期精度的量化模型。
-## 2. paddleslim.quant.AnalysisQuant 可传入参数解析
+## 2. paddleslim.quant.AnalysisPTQ 可传入参数解析
-```yaml
+| **参数名**                   | **参数释义**                              |
-model_dir
+|-----------------------------|-----------------------------------------|
-model_filename: None
+| model_dir | 必须传入的模型文件路径，可为文件夹名；若模型为ONNX类型，直接输入'.onnx'模型文件名称即可 |
-params_filename: None
+| model_filename | 默认为None，若model_dir为文件夹名，则必须传入以'.pdmodel'结尾的模型名称，若model_dir为'.onnx'模型文件名称，则不需要传入 |
-eval_function: None
+| params_filename | 默认为None，若model_dir为文件夹名，则必须传入以'.pdiparams'结尾的模型名称，若model_dir为'.onnx'模型文件名称，则不需要传入 |
-data_loader: None
+| eval_function | 若需要验证精度，需要传入自定义的验证函数 |
-save_dir: 'analysis_results'
+| data_loader | 模型校准时使用的数据，DataLoader继承自`paddle.io.DataLoader`。可以直接使用模型套件中的DataLoader，或者根据[paddle.io.DataLoader](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html#dataloader)自定义所需要的DataLoader |
-resume: False
+| save_dir | 分析后保存模型精度或pdf等文件的文件夹，默认为`analysis_results`|
-ptq_config
+| resume | 是否加载中间分析文件，默认为False|
-```
+| ptq_config | 可传入的离线量化中的参数，详细可参考[离线量化文档](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/quant/quant_post) |
- model_dir: 必须传入的模型文件路径，可为文件夹名；若模型为ONNX类型，直接输入'.onnx'模型文件名称即可。
- model_filename: 默认为None，若model_dir为文件夹名，则必须传入以'.pdmodel'结尾的模型名称，若model_dir为'.onnx'模型文件名称，则不需要传入。
- params_filename: 默认为None，若model_dir为文件夹名，则必须传入以'.pdiparams'结尾的模型名称，若model_dir为'.onnx'模型文件名称，则不需要传入。
- eval_function：若需要验证精度，需要传入自定义的验证函数。
- data_loader：模型校准时使用的数据，DataLoader继承自`paddle.io.DataLoader`。可以直接使用模型套件中的DataLoader，或者根据[paddle.io.DataLoader](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html#dataloader)自定义所需要的DataLoader。
- save_dir：分析后保存模型精度或pdf等文件的文件夹，默认为`analysis_results`。
- resume：是否加载中间分析文件
- ptq_config：可传入的离线量化中的参数，详细可参考[离线量化文档](https://github.com/PaddlePaddle/PaddleSlim/tree/develop/demo/quant/quant_post)。
@@ -39,14 +32,14 @@ ptq_config
 ## 3. 量化分析工具的使用
 **创建量化分析工具** ：
 ```
-analyzer = AnalysisQuant(
+analyzer = AnalysisPTQ(
-		model_dir=config["model_dir"],
+        model_dir=config["model_dir"],
-		model_filename=config["model_filename"],
+        model_filename=config["model_filename"],
-		params_filename=config["params_filename"],
+        params_filename=config["params_filename"],
-		eval_function=eval_function,
+        eval_function=eval_function,
-		data_loader=data_loader,
+        data_loader=data_loader,
-		save_dir=config['save_dir'],
+        save_dir=config['save_dir'],
-		ptq_config=config['PTQ'])
+        ptq_config=config['PTQ'])
 ```
 **统计分析**
@@ -64,21 +57,21 @@ analyzer.statistical_analyse()
 - `quantized_activation_histplot.pdf`：量化后INT数据类型的模型激活直方图
 - `quantized_weight_histplot.pdf`：量化后INT数据类型的模型权重直方图
 - `statistic.csv`：量化前后权重和激活的具体数据信息，表格中会保存的信息有：
-	- Var Name: Variable的名称
+    - Var Name: Variable的名称
-	- Var Type：Variable的类型，Weight或Activation
+    - Var Type：Variable的类型，Weight或Activation
-	- Corresponding Weight Name：如果为Activation，其对应的Weight名称
+    - Corresponding Weight Name：如果为Activation，其对应的Weight名称
-	- FP32 Min：量化前Float数据类型的最小值
+    - FP32 Min：量化前Float数据类型的最小值
-	- FP32 Max：量化前Float数据类型的最大值
+    - FP32 Max：量化前Float数据类型的最大值
-	- FP32 Mean：量化前Float数据类型的平均值
+    - FP32 Mean：量化前Float数据类型的平均值
-	- FP32 Std：量化前Float数据类型的方差值
+    - FP32 Std：量化前Float数据类型的方差值
-	- Quantized Min：量化后INT数据类型的最小值
+    - Quantized Min：量化后INT数据类型的最小值
-	- Quantized Max：量化后INT数据类型的最大值
+    - Quantized Max：量化后INT数据类型的最大值
-	- Quantized Mean：量化后INT数据类型的平均值
+    - Quantized Mean：量化后INT数据类型的平均值
-	- Quantized Std：量化后INT数据类型的方差值
+    - Quantized Std：量化后INT数据类型的方差值
-	- Diff Min：量化前后该Variable的相差的最小值
+    - Diff Min：量化前后该Variable的相差的最小值
-	- Diff Max：量化前后该Variable的相差的最大值
+    - Diff Max：量化前后该Variable的相差的最大值
-	- Diff Mean：量化前后该Variable的相差的平均值
+    - Diff Mean：量化前后该Variable的相差的平均值
-	- Diff Std：量化前后该Variable的相差的方差值
+    - Diff Std：量化前后该Variable的相差的方差值
 **精度误差分析**
@@ -89,10 +82,18 @@ analyzer.metric_error_analyse()
-**直接产出符合预期精度的量化模型**
+**直接产出符合预期精度的目标量化模型**
 ```
 analyzer.get_target_quant_model(target_metric)
 ```
 ## 4. 根据分析结果执行离线量化
 执行完量化分析工具后，可根据 `analysis.txt` 中的精度排序，在量化中去掉效果较差的层，具体操作为：在调用 `paddleslim.quant.quant_post_static` 时加入参数 `skip_tensor_list`，将需要去掉的层传入即可。
+## FAQ：
+- 与QAT(Quantization-Aware Training)量化分析工具的区别：与QAT量化分析工具不同的是，PTQ量化分析工具则是加载待量化的原模型，对模型所有层依次进行量化，每次量化一层，进行验证获取精度误差分析。而QAT量化分析工具加载量化训练后的量化模型，遍历所有量化的层，依次去掉量化层，加载Float模型的参数，并进行验证获取精度误差分析。
+- PTQ量化分析工具设计的原因：PTQ量化分析工具依次量化模型中的每一层，而不是依次去掉量化层是由于PTQ本身的高效性。依次量化一层进行验证，查看对模型精度的损失十分直观。
+- 量化分析工具为什么要区分PTQ和QAT：实验证明PTQ和QAT后的量化模型的敏感层并不完全一致，将两种算法分开，敏感度分析结果更加准确。
--- a/docs/zh_cn/tutorials/quant/AnalysisQAT.md
+++ b/docs/zh_cn/tutorials/quant/AnalysisQAT.md
+# QAT(Quantization-Aware Training)量化分析工具详细教程
+## 1. 量化分析工具功能
+精度误差分析(metric_error_analyse)：
+ - 遍历量化训练后模型的每层，去掉量化节点并计算当前层不量化的模型精度。该功能可以定位具体某层导致的量化损失。
+## 2. paddleslim.quant.AnalysisQAT 可传入参数解析
+| **参数名**                   | **参数释义**                              |
+|-----------------------------|-----------------------------------------|
+| quant_model_dir | 必须传入的量化后的模型文件路径 |
+| float_model_dir | 必须传入的量化前的模型文件路径 |
+| model_filename | 默认为None，若model_dir为文件夹名，则必须传入以'.pdmodel'结尾的模型名称 |
+| params_filename | 默认为None，若model_dir为文件夹名，则必须传入以'.pdiparams'结尾的模型名称 |
+| quantizable_op_type | 需分析的量化的op类型，默认为`conv2d`, `depthwise_conv2d`, `mul` |
+| qat_metric | 量化模型的精度，可不传入，默认为None，不传入时会自动计算 |
+| eval_function | 需要传入自定义的验证函数 |
+| data_loader | 模型校准时使用的数据，DataLoader继承自`paddle.io.DataLoader`。可以直接使用模型套件中的DataLoader，或者根据[paddle.io.DataLoader](https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/io/DataLoader_cn.html#dataloader)自定义所需要的DataLoader |
+| save_dir | 分析后保存模型精度或pdf等文件的文件夹，默认为`analysis_results`|
+| resume | 是否加载中间分析文件，默认为False|
+## 3. 量化分析工具的使用
+**创建量化分析工具** ：
+```
+analyzer = AnalysisQAT(
+    quant_model_dir=config["quant_model_dir"],
+    float_model_dir=config["float_model_dir"],
+    model_filename=config["model_filename"],
+    params_filename=config["params_filename"],
+    quantizable_op_type=config['quantizable_op_type'],
+    qat_metric=config['qat_metric'],
+    eval_function=eval_function,
+    data_loader=eval_loader,
+    save_dir=config['save_dir'],
+    resume=config['resume'],
+)
+```
+**精度误差分析**
+```
+analyzer.metric_error_analyse()
+```
+调用该接口，会遍历量化模型中的每一层，去掉量化节点并计算当前层不量化的模型精度。调用该接口时，需要输入Eval Function。会产出所有去掉一层量化的模型精度排序，将默认保存在 `./analysis_results/analysis.txt` 中。具体使用可参考[GPT量化训练敏感度分析DEMO](../../../../example/quantization_analysis/GPT/README.md)。
+## FAQ：
+- 与PTQ(Post Training Quantization)量化分析工具的区别：与PTQ量化分析工具不同的是，QAT量化分析工具加载量化训练后的量化模型，遍历所有量化的层，依次去掉量化层，加载Float模型的参数，并进行验证获取精度误差分析。而PTQ量化分析工具则是加载待量化的原模型，对模型所有层依次进行量化，每次量化一层，进行验证获取精度误差分析。
+- QAT量化分析工具设计的原因：QAT量化分析工具依次去掉量化层，而不是依次量化一层是由于QAT需要训练的特性。遍历每层进行量化训练再验证精度比较耗时，直接加载量化训练后的量化模型，依次去掉量化层更高效。
+- 量化分析工具为什么要区分PTQ和QAT：实验证明PTQ和QAT后的量化模型的敏感层并不完全一致，将两种算法分开，敏感度分析结果更加准确。
--- a/example/post_training_quantization/detection/README.md
+++ b/example/post_training_quantization/detection/README.md
@@ -130,7 +130,7 @@ python eval.py --config_path=./configs/ppyoloe_s_ptq.yaml
 - 要测试的模型路径可以在配置文件中`model_dir`字段下进行修改。
 #### 3.6 提高离线量化精度
-本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisQuant```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisQuant```详解见[AnalysisQuant.md](../../../docs/zh_cn/tutorials/quant/AnalysisQuant.md)。
+本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisPTQ```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisPTQ```详解见[AnalysisPTQ.md](../../../docs/zh_cn/tutorials/quant/AnalysisPTQ.md)。
 经过多个实验，包括尝试多种激活算法（avg，KL等）、weight的量化方式（abs_max，channel_wise_abs_max），对PicoDet-s进行离线量化后精度均为0，以PicoDet-s为例，量化分析工具具体使用方法如下：

--- a/example/post_training_quantization/detection/analysis.py
+++ b/example/post_training_quantization/detection/analysis.py
@@ -23,7 +23,7 @@ from ppdet.core.workspace import create
 from ppdet.metrics import COCOMetric, VOCMetric, KeyPointTopDownCOCOEval
 from keypoint_utils import keypoint_post_process
 from post_process import PPYOLOEPostProcess
-from paddleslim.quant.analysis import AnalysisQuant
+from paddleslim.quant.analysis_ptq import AnalysisPTQ
 def argsparser():
@@ -161,7 +161,7 @@ def main():
    else:
        raise ValueError("metric currently only supports COCO and VOC.")
-    analyzer = AnalysisQuant(
+    analyzer = AnalysisPTQ(
        model_dir=config["model_dir"],
        model_filename=config["model_filename"],
        params_filename=config["params_filename"],

--- a/example/post_training_quantization/pytorch_yolo_series/README.md
+++ b/example/post_training_quantization/pytorch_yolo_series/README.md
@@ -116,7 +116,9 @@ python eval.py --config_path=./configs/yolov5s_ptq.yaml
 #### 3.6 提高离线量化精度
-本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisQuant```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisQuant```详解见[AnalysisQuant.md](../../../docs/zh_cn/tutorials/quant/AnalysisQuant.md)。
+###### 3.6.1 量化分析工具
+本节介绍如何使用量化分析工具提升离线量化精度。离线量化功能仅需使用少量数据，且使用简单、能快速得到量化模型，但往往会造成较大的精度损失。PaddleSlim提供量化分析工具，会使用接口```paddleslim.quant.AnalysisPTQ```，可视化展示出不适合量化的层，通过跳过这些层，提高离线量化模型精度。```paddleslim.quant.AnalysisPTQ```详解见[AnalysisPTQ.md](../../../docs/zh_cn/tutorials/quant/AnalysisPTQ.md)。
 由于YOLOv6离线量化效果较差，以YOLOv6为例，量化分析工具具体使用方法如下：

--- a/example/post_training_quantization/pytorch_yolo_series/analysis.py
+++ b/example/post_training_quantization/pytorch_yolo_series/analysis.py
@@ -21,7 +21,7 @@ from tqdm import tqdm
 from post_process import YOLOPostProcess, coco_metric
 from dataset import COCOValDataset, COCOTrainDataset
 from paddleslim.common import load_config, load_onnx_model
-from paddleslim.quant.analysis import AnalysisQuant
+from paddleslim.quant.analysis_ptq import AnalysisPTQ
 def argsparser():
@@ -103,7 +103,7 @@ def main():
    load_onnx_model(config["model_dir"])
    inference_model_path = config["model_dir"].rstrip().rstrip(
        '.onnx') + '_infer'
-    analyzer = AnalysisQuant(
+    analyzer = AnalysisPTQ(
        model_dir=inference_model_path,
        model_filename='model.pdmodel',
        params_filename='model.pdiparams',

--- a/example/quantization_analysis/GPT/README.md
+++ b/example/quantization_analysis/GPT/README.md
+# GPT量化训练敏感度分析示例
+## 1. 简介
+本示例将以自然语言处理生成模型GPT-3为例，介绍如何使用量化训练敏感度分析工具分析量化模型，以及提升量化训练精度。
+## 2.Benchmark
+| 模型  |  策略  | ACC | Inference模型 |
+| :-------- |:-------- | :--------: | :--------: |
+| GPT-345M | Baseline | 44.17 | [Model](https://bj.bcebos.com/v1/paddle-slim-models/GPT_345M_Baseline.tar) |
+| GPT-345M | 量化训练(分析前) | 41.58 | [Model](https://bj.bcebos.com/v1/paddle-slim-models/GPT_345_QAT_wo_analysis.tar) |
+| GPT-345M | 量化训练(分析后)  | 44.94 | [Model](https://bj.bcebos.com/v1/paddle-slim-models/GPT_345M_QAT_w_analysis_infer.tar) |
+- ACC的指标均在基于[LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl)数据集，采用 ACC(accuracy) 指标评测得到
+## 3. 量化分析流程
+#### 3.1 准备环境
+- PaddlePaddle >= 2.3 （可从[Paddle官网](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)下载安装）
+- PaddleSlim develop版本
+- PaddleFleetX >= 2.4
+#### 3.2 准备数据集
+量化敏感度分析基于验证集获得每层的敏感度，可下载和使用 [LAMBADA](https://raw.githubusercontent.com/cybertronai/bflm/master/lambada_test.jsonl) 或者 [WikiText](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) 数据集。本示例使用LAMBADA数据集来进行敏感度分析。
+#### 3.3 准备预测模型
+- [GPT-345M](https://bj.bcebos.com/v1/paddle-slim-models/GPT_345M_Baseline.tar) ：Base模型
+- [GPT-345M](https://bj.bcebos.com/v1/paddle-slim-models/GPT_345_QAT_wo_analysis.tar) ：分析前量化训练后的模型
+#### 3.4 量化敏感度分析
+量化敏感度分析示例通过analysis.py脚本启动，会使用接口```paddleslim.quant.AnalysisQAT```对模型进行敏感度分析。配置config文件中模型路径、数据路径和量化相关的参数，配置完成后便可对模型进行敏感度分析。具体运行命令为：
+```shell
+python analysis.py --config_path=./configs/gpt_345M_analysis.yaml
+```
+分析完成后，会产生排序好的层敏感度（敏感度由大到小排序，敏感度越大说明约负向影响模型精度），并保存在```analysis_results/analysis.txt```中。
+敏感度排序前10层分别为：```linear_31```，```linear_27```，```linear_22```，```linear_43```，```linear_83```，```linear_15```，```linear_87```，```linear_3```，```linear_38```，```linear_39```。在这十层中，其中有八层属于```TransformerDecoder```中第二个FFN层，两层属于```TransformerDecoder```中第一个FFN层，而```MultiHeadAttention```中的Linear层都相对不敏感。
+```paddleslim.quant.AnalysisQAT```详解见[AnalysisQAT.md](../../../docs/zh_cn/tutorials/quant/AnalysisQAT.md)。
+#### 3.5 重新量化训练
+根据分析结果，重新量化训练时，去掉了```linear_31```，```linear_27```，```linear_22```，```linear_43```，```linear_83```，```linear_15```，```linear_87```七层Linear的量化，最后量化模型精度达到44.94。
--- a/example/quantization_analysis/GPT/analysis.py
+++ b/example/quantization_analysis/GPT/analysis.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+import random
+import numpy as np
+import argparse
+import time
+import paddle
+from paddleslim.common import load_config as load_slim_config
+from paddleslim.quant.analysis_qat import AnalysisQAT
+from ppfleetx.data import build_dataloader
+from ppfleetx.distributed.apis import env
+from utils import parse_config
+def argsparser():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        '--config_path',
+        type=str,
+        default=None,
+        help="path of compression strategy config.",
+        required=True)
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default='analysis_results',
+        help="directory to save compressed model.")
+    parser.add_argument(
+        '--devices',
+        type=str,
+        default='gpu',
+        help="which device used to compress.")
+    return parser
+def eval_reader_wrapper(reader):
+    def gen():
+        for data in reader:
+            tokens, loss_mask, attention_mask, position_ids, labels, info = data
+            in_dict = {}
+            in_dict['tokens'] = tokens
+            in_dict['ids'] = position_ids
+            yield in_dict, labels, loss_mask, info
+    return gen
+def eval_function(exe, program, feed_names, fetch_list):
+    tic_eval = time.time()
+    score_name = "loss" if not global_config['cloze_eval'] else "number correct"
+    first_step = True
+    eval_losses = []
+    total_score = 0
+    for eval_step, (data, labels, loss_mask, info) in enumerate(eval_loader()):
+        preds = exe.run(program=program,
+                        feed=data,
+                        fetch_list=fetch_list,
+                        return_numpy=False)
+        paddle.disable_static()
+        labels = paddle.to_tensor(labels)
+        preds = paddle.to_tensor(preds[0])
+        loss_mask = paddle.to_tensor(loss_mask)
+        info = paddle.to_tensor(info)
+        if not global_config['cloze_eval']:
+            if first_step:
+                num_original_tokens = info.numpy()[0][0]
+                num_tokenized_tokens = info.numpy()[0][1]
+                first_step = False
+            masked_lm_loss = paddle.nn.functional.cross_entropy(
+                preds, labels, reduction="none")
+            loss = paddle.sum(masked_lm_loss * loss_mask)
+            eval_losses.append(loss.numpy()[0])
+            total_score += loss.numpy() / (num_tokenized_tokens - 1)
+        else:
+            if first_step:
+                num_examples = info.numpy()[0][0]
+                first_step = False
+            outputs = paddle.argmax(preds, -1)
+            acc = paddle.cast(outputs == labels, 'float32')
+            acc = paddle.where(
+                paddle.cast(loss_mask, 'bool'), acc, paddle.ones_like(acc))
+            acc = paddle.sum(paddle.prod(acc, -1))
+            eval_losses.append(acc.numpy()[0])
+            total_score += acc.numpy()[0]
+        if eval_step != 0 and (eval_step % 10 == 0):
+            print("[eval] step: %d, batch: %d, %s: %.9f, speed: %.2f step/s" %
+                  (eval_step, eval_step, score_name, total_score,
+                   1. / (time.time() - tic_eval)))
+            tic_eval = time.time()
+        paddle.enable_static()
+    metric = None
+    if not global_config['cloze_eval']:
+        total_loss = float(total_score)
+        ppl = math.exp(min(20, total_loss))
+        token_ratio = (num_tokenized_tokens - 1) / (num_original_tokens - 1)
+        adjusted_ppl = math.exp(min(20, total_loss * token_ratio))
+        string = ' validation results on {} | '.format(gpt_config['Data'][
+            'Eval']['dataset']['name'])
+        string += 'avg loss: {:.4E} | '.format(total_loss)
+        string += 'ppl: {:.4E} | '.format(ppl)
+        string += 'adjusted ppl: {:.4E} | '.format(adjusted_ppl)
+        string += 'token ratio: {} |'.format(token_ratio)
+        metric = ppl
+    else:
+        num_correct = float(total_score)
+        acc = float(num_correct / num_examples)
+        string = ' validation results on {} | '.format(gpt_config['Data'][
+            'Eval']['dataset']['name'])
+        string += 'number correct: {:.4E} | '.format(num_correct)
+        string += 'total examples: {:.4E} | '.format(num_examples)
+        string += 'avg accuracy: {:.4E}'.format(acc)
+        metric = acc
+    print(string)
+    return metric
+def main():
+    global global_config, all_config
+    all_config = load_slim_config(FLAGS.config_path)
+    assert "Global" in all_config, "Key 'Global' not found in config file. \n{}".format(
+        all_config)
+    global_config = all_config["Global"]
+    seed = all_config['Global']['seed']
+    random.seed(seed)
+    np.random.seed(seed)
+    paddle.seed(seed)
+    env.set_seed(seed)
+    global gpt_config
+    gpt_config = parse_config(global_config['reader_config'])
+    if not global_config['cloze_eval']:
+        gpt_config['Data']['Eval']['dataset']['name'] = "LM_Eval_Dataset"
+    else:
+        gpt_config['Data']['Eval']['dataset']['name'] = "Lambada_Eval_Dataset"
+    valid_data_loader = build_dataloader(gpt_config['Data'], "Eval")
+    global eval_loader
+    eval_loader = eval_reader_wrapper(valid_data_loader)
+    analyzer = AnalysisQAT(
+        quant_model_dir=global_config["quant_model_dir"],
+        float_model_dir=global_config["float_model_dir"],
+        model_filename=global_config["model_filename"],
+        params_filename=global_config["params_filename"],
+        quantizable_op_type=global_config['quantizable_op_type'],
+        qat_metric=global_config['qat_metric']
+        if 'qat_metric' in global_config else None,
+        eval_function=eval_function,
+        data_loader=eval_loader,
+        save_dir=FLAGS.save_dir,
+        resume=global_config['resume'], )
+    analyzer.metric_error_analyse()
+if __name__ == '__main__':
+    paddle.enable_static()
+    parser = argsparser()
+    FLAGS = parser.parse_args()
+    assert FLAGS.devices in ['cpu', 'gpu', 'xpu', 'npu']
+    paddle.set_device(FLAGS.devices)
+    main()
--- a/example/quantization_analysis/GPT/configs/gpt_345M_analysis.yaml
+++ b/example/quantization_analysis/GPT/configs/gpt_345M_analysis.yaml
+Global:
+  device: gpu
+  seed: 1024
+  quant_model_dir: ./GPT_345_QAT_wo_analysis
+  float_model_dir: ./GPT_345M_Baseline
+  model_filename: model.pdmodel
+  params_filename: model.pdiparams
+  quantizable_op_type: ["mul", "matmul", "matmul_v2"]
+  resume: False
+  reader_config: ./configs/gpt_reader.yaml
+  cloze_eval: True # True for LAMBADA Dataset; False for WikiText
\ No newline at end of file
--- a/example/quantization_analysis/GPT/configs/gpt_reader.yaml
+++ b/example/quantization_analysis/GPT/configs/gpt_reader.yaml
+Data:
+  Eval:
+    dataset:
+      name: GPTDataset
+      input_dir: ./lambada_test.jsonl 
+      max_seq_len: 1024
+      overlapping_eval: 32
+    loader:
+      num_workers: 1
+      return_list: True
+      collate_fn: gpt_collate_fn
+      batch_size: 1
\ No newline at end of file
--- a/example/quantization_analysis/GPT/utils.py
+++ b/example/quantization_analysis/GPT/utils.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import codecs
+import yaml
+import time
+import copy
+class AttrDict(dict):
+    def __getattr__(self, key):
+        return self[key]
+    def __setattr__(self, key, value):
+        if key in self.__dict__:
+            self.__dict__[key] = value
+        else:
+            self[key] = value
+    def __copy__(self):
+        cls = self.__class__
+        result = cls.__new__(cls)
+        result.__dict__.update(self.__dict__)
+        return result
+    def __deepcopy__(self, memo):
+        cls = self.__class__
+        result = cls.__new__(cls)
+        memo[id(self)] = result
+        for k, v in self.__dict__.items():
+            setattr(result, k, copy.deepcopy(v, memo))
+        for k, v in self.items():
+            setattr(result, k, copy.deepcopy(v, memo))
+        return result
+    def setdefault(self, k, default=None):
+        if k not in self or self[k] is None:
+            self[k] = default
+            return default
+        else:
+            return self[k]
+def create_attr_dict(yaml_config):
+    from ast import literal_eval
+    for key, value in yaml_config.items():
+        if type(value) is dict:
+            yaml_config[key] = value = AttrDict(value)
+        if isinstance(value, str):
+            try:
+                value = literal_eval(value)
+            except BaseException:
+                pass
+        if isinstance(value, AttrDict):
+            create_attr_dict(yaml_config[key])
+        else:
+            yaml_config[key] = value
+def parse_config(cfg_file):
+    """Load a config file into AttrDict"""
+    def _update_dic(dic, base_dic):
+        '''Update config from dic based base_dic
+        '''
+        base_dic = base_dic.copy()
+        dic = dic.copy()
+        if dic.get('_inherited_', True) == False:
+            dic.pop('_inherited_')
+            return dic
+        for key, val in dic.items():
+            if isinstance(val, dict) and key in base_dic:
+                base_dic[key] = _update_dic(val, base_dic[key])
+            else:
+                base_dic[key] = val
+        dic = base_dic
+        return dic
+    def _parse_from_yaml(path):
+        '''Parse a yaml file and build config'''
+        with codecs.open(path, 'r', 'utf-8') as file:
+            dic = yaml.load(file, Loader=yaml.FullLoader)
+        if '_base_' in dic:
+            cfg_dir = os.path.dirname(path)
+            base_path = dic.pop('_base_')
+            base_path = os.path.join(cfg_dir, base_path)
+            base_dic = _parse_from_yaml(base_path)
+            dic = _update_dic(dic, base_dic)
+        return dic
+    yaml_dict = _parse_from_yaml(cfg_file)
+    yaml_config = AttrDict(yaml_dict)
+    create_attr_dict(yaml_config)
+    return yaml_config
--- a/paddleslim/quant/analysis.py
+++ b/paddleslim/quant/analysis.py
@@ -37,10 +37,10 @@ from ..common import get_feed_vars, wrap_dataloader, load_inference_model, get_m
 _logger = get_logger(__name__, level=logging.INFO)
-__all__ = ["AnalysisQuant"]
+__all__ = ["AnalysisPTQ"]
-class AnalysisQuant(object):
+class AnalysisPTQ(object):
    def __init__(self,
                 model_dir,
                 model_filename=None,
@@ -51,7 +51,7 @@ class AnalysisQuant(object):
                 resume=False,
                 ptq_config=None):
        """
-        AnalysisQuant provides to analysis the sensitivity of each op in the model.
+        AnalysisPTQ provides to analysis the sensitivity of each op in the model.
        Args:
            model_dir(str): the path of fp32 model that will be quantized, it can also be '.onnx'
@@ -403,7 +403,8 @@ class AnalysisQuant(object):
        statistic = []
        box_fp_dist, box_q_dist = [], []
        hist_fp_dist, hist_q_dist = {}, {}
-        for var_name in fp_tensors:
+        fp_tensor_names = sorted(list(fp_tensors.keys()))
+        for var_name in fp_tensor_names:
            fp_tensor = fp_tensors[var_name]
            quant_name = var_name_map[
                var_name] if var_name_map is not None else var_name
@@ -503,7 +504,9 @@ class AnalysisQuant(object):
            for name in hist_data:
                plt.hist(hist_data[name][0], bins=hist_data[name][1])
                plt.xlabel(name)
-                plt.ylabel("Frequency")
+                plt.ylabel("Probability")
+                locs, _ = plt.yticks()
+                plt.yticks(locs, np.round(locs / len(hist_data[name][0]), 3))
                if 'act' in save_name:
                    plt.title("Hist of Activation {}".format(name))
                else:

--- a/paddleslim/quant/analysis_qat.py
+++ b/paddleslim/quant/analysis_qat.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import sys
+import pickle
+import copy
+import logging
+import numpy as np
+import paddle
+from paddle.fluid import core
+from paddle.fluid.framework import IrGraph
+from ..common import get_logger, load_inference_model
+_logger = get_logger(__name__, level=logging.INFO)
+__all__ = ["AnalysisQAT"]
+class AnalysisQAT(object):
+    def __init__(self,
+                 quant_model_dir,
+                 float_model_dir,
+                 model_filename=None,
+                 params_filename=None,
+                 quantizable_op_type=["conv2d", "depthwise_conv2d", "mul"],
+                 qat_metric=None,
+                 eval_function=None,
+                 data_loader=None,
+                 save_dir='analysis_results',
+                 resume=False):
+        '''
+        AnalysisQAT provides to analysis the sensitivity of each op in the model.
+        Args:
+            quant_model_dir(str): the path of INT8 model that quantized through QAT
+            float_model_dir(str): the path of FP32 model that is the base model of quant_model
+            model_filename(str, optional): the model file name of the model
+            params_filename(str, optional): the parameter file name of the model
+            quantizable_op_type(list of str, optional): the type of op that will be analyzed
+            qat_metric(float, optional): the metric of the quantized model, which will be calculated automatically if is None
+            eval_function(function): eval function, define by yourself to return the metric of the inference program, can be used to judge the metric of quantized model. 
+            data_loader(Python Generator, Paddle.io.DataLoader, optional): the
+                Generator or Dataloader provides calibrate data, and it could
+                return a batch every time
+            save_dir(str, optional): the output dir that stores the analyzed information
+            resume(bool, optional): When break off while ananlyzing, could resume analysis program and load already analyzed information.
+        '''
+        if model_filename is None:
+            model_filename = 'model.pdmodel'
+        if params_filename is None:
+            params_filename = 'model.pdiparams'
+        self.quant_model_dir = quant_model_dir
+        self.float_model_dir = float_model_dir
+        self.model_filename = model_filename
+        self.params_filename = params_filename
+        self.quantizable_op_type = quantizable_op_type
+        self.qat_metric = qat_metric
+        self.eval_function = eval_function
+        self.save_dir = save_dir
+        self.checkpoint_name = os.path.join(save_dir, 'analysis_checkpoint.pkl')
+        self.nonquant_layer_metrics = {}
+        if not os.path.exists(self.save_dir):
+            os.mkdir(self.save_dir)
+        devices = paddle.device.get_device().split(':')[0]
+        self.places = paddle.device._convert_to_place(devices)
+        executor = paddle.static.Executor(self.places)
+        [program, self.feed_list, self.fetch_list] = load_inference_model(
+            self.quant_model_dir,
+            executor=executor,
+            model_filename=self.model_filename,
+            params_filename=self.params_filename)
+        _logger.info('Loaded model from: {}'.format(quant_model_dir))
+        graph = IrGraph(core.Graph(program.desc), for_test=True)
+        # find all inputs for each quantizable op
+        self.inputs_of_quantized_op = []
+        sorted_ops = graph.topology_sort()
+        for op_node in sorted_ops:
+            op_name = op_node.name()
+            if op_name in quantizable_op_type:
+                input_names = op_node.op().input_arg_names()
+                for input_name in input_names:
+                    if 'quantized' in input_name:
+                        self.inputs_of_quantized_op.append(input_names)
+                        break
+        if self.qat_metric is None:
+            _logger.info('Calculating the metric of QAT model...')
+            self.qat_metric = self.eval_function(
+                executor, program, self.feed_list, self.fetch_list) * 100
+            _logger.info('The metric of QAT model is {}'.format(
+                round(self.qat_metric, 4)))
+        executor.close()
+    def save_checkpoint(self):
+        if not os.path.exists(self.save_dir):
+            os.makedirs(self.save_dir)
+        with open(self.checkpoint_name, 'wb') as f:
+            pickle.dump(self.nonquant_layer_metrics, f)
+        _logger.info('Save checkpoint to {}.'.format(self.checkpoint_name))
+    def load_checkpoint(self):
+        if not os.path.exists(self.checkpoint_name):
+            _logger.info('Checkpoint path {} does not exist.'.format(
+                self.checkpoint_name))
+            return False
+        with open(self.checkpoint_name, 'rb') as f:
+            self.nonquant_layer_metrics = pickle.load(f)
+        _logger.info('Load checkpoint from {}.'.format(self.checkpoint_name))
+        return True
+    def get_weight_name(self, inputs_names):
+        # TODO(xc)
+        w_idx = 0 if 'w_0' in inputs_names[0] else 1
+        weight_name = inputs_names[w_idx].split('.quantized.dequantized')[0]
+        return weight_name
+    def get_new_in_out_map(
+            self,
+            input_list,
+            graph,
+            float_scope,
+            quant_scope, ):
+        input_rename_map = {}
+        output_rename_map = {}
+        removed_ops = []
+        for op_node in graph.all_op_nodes():
+            if op_node.id() in removed_ops:
+                continue
+            in_names = op_node.input_arg_names()
+            out_names = op_node.output_arg_names()
+            if len(out_names) == 1 and out_names[0] in input_list:
+                in_var = graph._find_node_by_name(op_node.inputs,
+                                                  op_node.input('X')[0])
+                out_var = graph._find_node_by_name(op_node.outputs,
+                                                   op_node.output('Y')[0])
+                if 'quantized' in in_var.name():
+                    # act
+                    for op in graph.all_op_nodes():
+                        o_ns = op.output_arg_names()
+                        if len(o_ns) == 1 and o_ns[0] == in_var.name():
+                            in_var_1 = graph._find_node_by_name(
+                                op.inputs, op.input('X')[0])
+                            graph.safe_remove_nodes(op)
+                            removed_ops.append(op.id())
+                            input_rename_map[out_var.node] = in_var_1
+                else:
+                    # weight
+                    with paddle.static.scope_guard(float_scope):
+                        float_weight = np.array(
+                            float_scope.find_var(in_var.name()).get_tensor())
+                    with paddle.static.scope_guard(quant_scope):
+                        quant_scope.find_var(in_var.name()).get_tensor().set(
+                            float_weight, self.places)
+                    input_rename_map[out_var.node] = in_var
+                graph.safe_remove_nodes(op_node)
+                removed_ops.append(op_node.id())
+                output_rename_map[in_var.node] = out_var
+        return input_rename_map, output_rename_map, removed_ops
+    def relink_graph(self, graph, input_rename_map, output_rename_map,
+                     removed_ops):
+        for op_node in graph.all_op_nodes():
+            if op_node.id() in removed_ops:
+                continue
+            for var in op_node.inputs:
+                if var.node in input_rename_map:
+                    old_in = var
+                    new_in = input_rename_map[var.node]
+                    graph.update_input_link(old_in, new_in, op_node)
+                    _logger.info(
+                        f'relink {op_node.name()} \'s input node from {old_in.name()} to {new_in.name()}.'
+                    )
+            for var in op_node.outputs:
+                if var.node in output_rename_map:
+                    old_out = var
+                    new_out = output_rename_map[var.node]
+                    graph.update_input_link(old_out, new_out, op_node)
+                    _logger.info(
+                        f'relink {op_node.name()} \'s output node from {old_out.name()} to {new_out.name()}.'
+                    )
+        return graph.to_program()
+    def metric_error_analyse(self):
+        executor = paddle.static.Executor(self.places)
+        float_scope = paddle.static.Scope()
+        quant_scope = paddle.static.Scope()
+        for idx, input_list in enumerate(self.inputs_of_quantized_op):
+            weight_name = self.get_weight_name(input_list)
+            _logger.info(
+                'Checking {}/{} quant model: without quant layer {}'.format(
+                    idx + 1, len(self.inputs_of_quantized_op), weight_name))
+            with paddle.static.scope_guard(float_scope):
+                load_inference_model(
+                    self.float_model_dir,
+                    executor=executor,
+                    model_filename=self.model_filename,
+                    params_filename=self.params_filename)
+            with paddle.static.scope_guard(quant_scope):
+                [program, self.feed_list,
+                 self.fetch_list] = load_inference_model(
+                     self.quant_model_dir,
+                     executor=executor,
+                     model_filename=self.model_filename,
+                     params_filename=self.params_filename)
+            program_copy = program.clone()
+            graph = IrGraph(core.Graph(program_copy.desc), for_test=True)
+            input_rename_map, output_rename_map, removed_ops = self.get_new_in_out_map(
+                input_list, graph, float_scope, quant_scope)
+            saved_program = self.relink_graph(graph, input_rename_map,
+                                              output_rename_map, removed_ops)
+            with paddle.static.scope_guard(quant_scope):
+                _logger.info('Skip quant {}, evaluating....'.format(
+                    weight_name))
+                metric = self.eval_function(executor, saved_program,
+                                            self.feed_list,
+                                            self.fetch_list) * 100
+                self.nonquant_layer_metrics[weight_name] = metric
+                _logger.info(
+                    'When skip quant {}, the metric is {}, the diff is {}'.
+                    format(weight_name,
+                           round(metric, 4), round(metric - self.qat_metric,
+                                                   4)))
+            self.save_checkpoint()
+        executor.close()
+        self.sensitivity_ranklist = sorted(
+            self.nonquant_layer_metrics,
+            key=self.nonquant_layer_metrics.get,
+            reverse=True)
+        _logger.info('Finished computing the sensitivity of the model.')
+        for name in self.sensitivity_ranklist:
+            _logger.info("without quant layer name: {}, eval metric: {}".format(
+                name, self.nonquant_layer_metrics[name]))
+        analysis_file = os.path.join(self.save_dir, "analysis.txt")
+        with open(analysis_file, "w") as analysis_ret_f:
+            for name in self.sensitivity_ranklist:
+                analysis_ret_f.write(
+                    "without layer name: {}, eval metric: {}\n".format(
+                        name, self.nonquant_layer_metrics[name]))
+        _logger.info('Analysis file is saved in {}'.format(analysis_file))