Add new mindinsight profiler components

9be463ac · wangyue01 · 35fcfc42 · 9be463ac · 9be463ac · 9be463ac
14 changed file
--- a/tutorials/source_en/advanced_use/images/data_op_profile.png
+++ b/tutorials/source_en/advanced_use/images/data_op_profile.png
--- a/tutorials/source_en/advanced_use/images/minddata_profile.png
+++ b/tutorials/source_en/advanced_use/images/minddata_profile.png
--- a/tutorials/source_en/advanced_use/images/performance_overall.png
+++ b/tutorials/source_en/advanced_use/images/performance_overall.png
--- a/tutorials/source_en/advanced_use/images/step_trace.png
+++ b/tutorials/source_en/advanced_use/images/step_trace.png
--- a/tutorials/source_en/advanced_use/images/timeline.png
+++ b/tutorials/source_en/advanced_use/images/timeline.png
--- a/tutorials/source_en/advanced_use/performance_profiling.md
+++ b/tutorials/source_en/advanced_use/performance_profiling.md
+# Performance Profiler
+
+<!-- TOC -->
+
+- [Performance Profiler](#performance-profiler)
+    - [Overview](#overview)
+    - [Operation Process](#operation-process)
+    - [Preparing the Training Script](#preparing-the-training-script)
+    - [Launch MindInsight](#launch-mindinsight)
+    - [Performance Analysis](#performance-analysis)
+        - [Step Trace Analysis](#step-trace_analysis)
+        - [Operator Performance Analysis](#operator-performance-analysis)
+        - [MindData Performance Analysis](#minddata-performance-analysis)
+        - [Timeline Analysis](#timeline-analysis)
+  - [Specifications](#specifications)
+
+<!-- /TOC -->
+
+<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/performance_profiling.md" target="_blank"><img src="../_static/logo_source.png"></a>
+
+## Overview
+Performance data like operators' execution time are recorded in files and can be viewed on the web page, this can help the user optimize the performance of neural networks. MindInsight Profiler can only support the Ascend chip now.
+
+## Operation Process
+
+- Prepare a training script, add profiler apis in the training script, and run the training script.
+- Start MindInsight and specify the profile data directory using startup parameters. After MindInsight is started, access the visualization page based on the IP address and port number. The default access IP address is `http://127.0.0.1:8080`.
+- Find the training in the list, click the performance profiling link, and view the data on the web page.
+
+## Preparing the Training Script
+
+To enable the performance profiling of neural networks, MindInsight Profiler APIs should be added into the script. At first, the MindInsight `Profiler` object need
+to be set after set context and before the network initialization. Then, at the end of the training, `Profiler.analyse()` should be called to finish profiling and generate the perforamnce 
+analyse results.
+
+The sample code is as follows:
+
+```python
+from mindinsight.profiler import Profiler
+from mindspore import Model, nn, context
+
+
+def test_profiler():
+    # Init context env
+    context.set_context(mode=context.GRAPH_MODE, device_target='Ascend', device_id=int(os.environ["DEVICE_ID"]))
+    
+    # Init Profiler
+    profiler = Profiler(output_path='./data', is_detail=True, is_show_op_path=False, subgraph='all')
+    
+    # Init hyperparameter
+    epoch = 2
+    # Init network and Model
+    net = Net()
+    loss_fn = CrossEntropyLoss()
+    optim = MyOptimizer(learning_rate=0.01, params=network.trainable_params())
+    model = Model(net, loss_fn=loss_fn, optimizer=optim, metrics=None)  
+    # Prepare mindrecord_dataset for training
+    train_ds = create_mindrecord_dataset_for_training()
+    # Model Train
+    model.train(epoch, train_ds)
+    
+    # Profiler end
+    profiler.analyse()
+``` 
+
+
+## Launch MindInsight
+
+The MindInsight launch command can refer to the **MindInsight Command** part in [Training Process Visualization](https://www.mindspore.cn/tutorial/en/master/advanced_use/visualization_tutorials.html).
+
+
+### Performance Analysis
+
+Users can access the Performance Profiler by selecting a specific training from the training list, and click the performance profiling link.
+
+![performance_overall.png](./images/performance_overall.png)
+
+Figure 1： Overall Performance
+
+Figure 1 displays the overall performance of the training, including the overall data of Step Trace, Operator Performance, MindData Performance and Timeline. The data shown in these components include:
+
+- Step Trace: It will divide the training step into several stages and collect execution time for each stage. The overall performance page will show the step trace graph.
+- Operator Performance: It will collect the execution time of operators and operator types. The overall performance page will show the pie graph for different operator types.
+- MindData Performance: It will analyse the performance of the data input stages. The overall performance page will show the number of steps that may be the bottleneck for these stages.
+- Timeline: It will collect execution time for stream tasks on the devices. The tasks will be shown on the time axis. The overall performance page will show the statistics for streams and tasks.  
+
+Users can click the detail link to see the details of each components. Besides, MindInsight Profiler will try to analyse the performance data, the assistant on the left 
+will show performance tuning suggestions for this training.
+
+#### Step Trace Analysis
+
+The Step Trace Component is used to show the general performance of the stages in the training. Step Trace will divide the training into several stages:
+Step Gap (The time between the end of one step and the computation of next step)、Forward/Backward Propagation、 All Reduce and Parameter Update. It will show the execution time for each stage, and help to find the bottleneck
+stage quickly.
+
+![step_trace.png](./images/step_trace.png)
+
+Figure 2： Step Trace Analysis
+
+Figure 2 displays the Step Trace page. The Step Trace detail will show the start/finish time for each stage. By default, it shows the average time for all the steps. Users
+can also choose a specific step to see its step trace statistics. The graphs at the bottom of the page show how the execution time of Step Gap、Forward/Backward Propagation and
+Step Tail (The time between the end of Backward Propagation and the end of Parameter Update) changes according to different steps, it will help to decide whether we can optimize the performance of some stages. 
+
+In order to divide the stages, the Step Trace Component need to figure out the forward propagation start operator and the backward propagation end operator. MindSpore will automatically figure out the two operators to reduce 
+the profiler configuration work. The first operator after get_next will be selected as the forward start operator and the operator before the last all reduce will be selected as the backward end operator.
+**However, Profiler do not guarantee that the automatically selected operators will meet the user's expectation in all cases.** Users can set the two operators manually as follows:
+
+- Set environment variable ```FP_POINT``` to configure the forward start operator, for example, ```export FP_POINT=fp32_vars/conv2d/BatchNorm```
+- Set environment variable ```BP_POINT``` to configure the backward end operator, for example, ```export BP_POINT=loss_scale/gradients/AddN_70```
+
+
+#### Operator Performance Analysis
+
+The operator performance analysis component is used to display the execution time of the operators during MindSpore run.
+
+![op_type_statistics.png](./images/op_type_statistics.PNG)
+
+Figure 3: Statistics for Operator Types
+
+Figure 3 displays the statistics for the operator types, including:
+
+- Choose pie or bar graph to show the proportion time occupied by each operator type. The time of one operator type is calculated by accumulating the execution time of operators belong to this type.   
+- Display top 20 operator types with longest execution time, show the proportion and execution time (ms) of each operator type.
+
+![op_statistics.png](./images/op_statistics.PNG)
+
+Figure 4: Statistics for Operators
+
+Figure 4 displays the statistics table for the operators, including:
+
+- Choose All: Display statistics for the operators, including operator name, type, execution time, full scope time, information etc. The table will be sorted by execution time by default.
+- Choose Type: Display statistics for the operator types, including operator type name, execution time, execution frequency and proportion of total time. Users can click on each line, querying for all the operators belong to this type.
+- Search: There is a search box on the right, which can support fuzzy search for operators/operator types.
+
+#### MindData Performance Analysis
+
+The MindData performance analysis component is used to analyse the execution of data input pipeline for the training. The data input pipeline can be divided into three stages:
+the data process pipeline, data transfer from host to device and data fetch on device. The component will analyse the performance of each stage for detail and display the results. 
+
+![minddata_profile.png](./images/minddata_profile.png)
+
+Figure 5： MindData Performance Analysis
+
+Figure 5 displays the page of MindData performance analysis component. It consists of two tabs: The step gap and the data process.
+
+The step gap page is used to analyse whether there is performance bottleneck in the three stages. We can get our conclusion from the data queue graphs:
+
+- The data queue size stands for the queue length when the training fetches data from the queue on the device. If the data queue size is 0, the training will wait until there is data in
+the queue; If the data queue size is above 0, the training can get data very quickly, and it means MindData is not the bottleneck for this training step.
+- The host queue size can be used to infer the speed of data process and data transfer. If the host queue size is 0, it means we need to speed up the data process stage.
+- If the host queue size keeps big and the data queue size keeps very small, the data transfer may be the bottleneck.    
+
+![data_op_profile.png](./images/data_op_profile.png)
+
+Figure 6： Data Process Pipeline Analysis
+
+Figure 6 displays the page of data process pipeline analysis. The data queues are used to exchange data between the MindData operators. The data size of the queues reflect the
+data consume speed of the operators, and can be used to infer the bottleneck operator. The queue usage percentage stands for the average value of data size in queue divide data queue maximum size, the higher
+the usage percentage, the more data that is accumulated in the queue. The graph at the bottom of the page shows the MindData pipeline operators with the data queues, the user can click one queue to see how
+the data size changes according to the time, and the operators connected to the queue. The data process pipeline can be analysed as follows:
+
+- When the input queue usage percentage of one operator is high, and the output queue usage percentage is low, the operator may be the bottleneck;
+- For the leftmost operator, if the usage percentage of the queues on the right are all low, the operator may be the bottleneck;
+- For the rightmost operator, if the usage percentage of the queues on th left are all high, the operator may be the bottleneck. 
+
+To optimize the perforamnce of MindData operators, there are some suggestions:
+
+- If the `Dataset` Operator is the bottleneck, try to increase the `num_parallel_workers`;
+- If a `GeneratorOp` type operator is the bottleneck, try to increase the `num_parallel_workers` and replace the operator to `MindRecordDataset`;
+- If a `MapOp` type operator is the bottleneck, try to increase the `num_parallel_workers`; If it is a python operator, try to optimize the training script;
+- If a `BatchOp` type operator is the bottleneck, try to adjust the size of `prefetch_size`. 
+
+#### Timeline Analysis
+
+The Timeline component can display：
+
+- The operators (AICore/AICPU operators) are executed on which device;
+- The MindSpore stream split strategy for this neural network;
+- The time of tasks executed on the device.
+
+Users can get the most detailed information from the Timeline:
+
+- From high level, users can analyse whether the stream split strategy can be optimized and whether is step tail is too long;
+- From low level, users can analyse the execution time for all the operators, etc.
+
+![timeline.png](./images/timeline.png)
+
+Figure 7 Timeline Analysis
+
+The Timeline consists of the following parts:
+
+- **Device and Stream List**: It will show the stream list on each device. Each stream consists of a series of tasks. One rectangle stands for one task, and the area stands for the execution time of the task;
+- **The Operator Information**: When we click one task, the corresponding operator of this task will be shown at the bottom. 
+
+W/A/S/D can be applied to zoom in and out of the Timeline graph.
+
+##Specifications
+
+- To limit the data size generated by the Profiler, MindInsight suggests that for large neural network, the profiled steps should better below 10.
+- The parse of Timeline data is time consuming, and several step's data is usually enough for analysis. In order to speed up the data parse and UI 
+display, Profiler will show at most 20M data (Contain 10+ step information for large networks).
\ No newline at end of file
--- a/tutorials/source_en/advanced_use/visualization_tutorials.md
+++ b/tutorials/source_en/advanced_use/visualization_tutorials.md
@@ -17,8 +17,6 @@
        - [Model Lineage](#model-lineage)
        - [Dataset Lineage](#dataset-lineage)
        - [Scalars Comparision](#scalars-comparision)
-        - [Performance Profiler](#performance-profiler)
-            - [Operator Performance Analysis](#operator-performance-analysis)
  - [Specifications](#specifications)

 <!-- /TOC -->
@@ -229,42 +227,6 @@ In the saved files, `ms_output_after_hwopt.pb` is the computational graph after
 > - Currently MindSpore supports recording computational graph after operator fusion for Ascend 910 AI processor only.
 > - When using the Summary operator to collect data in training, 'HistogramSummary' operator affects performance, so please use as little as possible.

-### Collect Performance Profile Data
-
-To enable the performance profiling of neural networks, `MindInsight Profiler` APIs should be added into the script. At first, the `MindInsight Profiler` object need
-to be set after set context and before the network initialization. Then, at the end of the training, `Profiler.analyse` should be called to finish profiling and generate the perforamnce 
-analyse results.
-
-The sample code is as follows:
-
-```python
-from mindinsight.profiler import Profiler
-from mindspore import Model, nn, context
-
-
-def test_profiler():
-    # Init context env
-    context.set_context(mode=context.GRAPH_MODE, device_target='Ascend', device_id=int(os.environ["DEVICE_ID"]))
-    
-    # Init Profiler
-    profiler = Profiler(output_path='./data', is_detail=True, is_show_op_path=False, subgraph='all')
-    
-    # Init hyperparameter
-    epoch = 2
-    # Init network and Model
-    net = Net()
-    loss_fn = CrossEntropyLoss()
-    optim = MyOptimizer(learning_rate=0.01, params=network.trainable_params())
-    model = Model(net, loss_fn=loss_fn, optimizer=optim, metrics=None)  
-    # Prepare mindrecord_dataset for training
-    train_ds = create_mindrecord_dataset_for_training()
-    # Model Train
-    model.train(epoch, train_ds)
-    
-    # Profiler end
-    profiler.analyse()
-``` 
-
 ## MindInsight Commands

 ### View the command help information.
@@ -522,33 +484,6 @@ Figure 18 shows the scalars comparision function area, which allows you to view
 - Horizontal Axis: Select any of Step, Relative Time, and Absolute Time as the horizontal axis of the scalar curve.
 - Smoothness: Adjust the smoothness to smooth the scalar curve.

-### Performance Profiler
-
-Access the Performance Profiler by selecting a specific training from the training list.
-
-#### Operator Performance Analysis
-
-The operator performance analysis component is used to display the execution time of the operators during MindSpore run.
-
-![op_type_statistics.png](./images/op_type_statistics.PNG)
-
-Figure 19: Statistics for Operator Types
-
-Figure 19 displays the statistics for the operator types, including:
-
- Choose pie or bar graph to show the proportion time occupied by each operator type. The time of one operator type is calculated by accumulating the execution time of operators belong to this type.   
- Display top 20 operator types with longest execution time, show the proportion and execution time (ms) of each operator type.
-
-![op_statistics.png](./images/op_statistics.PNG)
-
-Figure 20: Statistics for Operators
-
-Figure 20 displays the statistics table for the operators, including:
-
- Choose All: Display statistics for the operators, including operator name、type、excution time、full scope time、information etc. The table will be sorted by execution time by default.
- Choose Type: Display statistics for the operator types, including operator type name、execution time、execution frequency and proportion of total time. Users can click on each line, querying for all the operators belong to this type.
- Search: There is a search box on the right, which can support fuzzy search for operators/operator types.
-
 ## Specifications

 To limit time of listing summaries, MindInsight lists at most 999 summary items.
@@ -563,5 +498,3 @@ To ensure performance, MindInsight implements scalars comparision with the cache
 - The scalars comparision supports only for trainings in cache. 
 - The maximum of 15 latest trainings (sorted by modification time) can be retained in the cache.
 - The maximum of 5 trainings can be selected for scalars comparision at the same time.
-
-To limit the data size generated by the Profiler, MindInsight suggests that for large neural network, the profiled steps should better below 10.
--- a/tutorials/source_zh_cn/advanced_use/images/data_op_profile.png
+++ b/tutorials/source_zh_cn/advanced_use/images/data_op_profile.png
--- a/tutorials/source_zh_cn/advanced_use/images/minddata_profile.png
+++ b/tutorials/source_zh_cn/advanced_use/images/minddata_profile.png
--- a/tutorials/source_zh_cn/advanced_use/images/performance_overall.png
+++ b/tutorials/source_zh_cn/advanced_use/images/performance_overall.png
--- a/tutorials/source_zh_cn/advanced_use/images/step_trace.png
+++ b/tutorials/source_zh_cn/advanced_use/images/step_trace.png
--- a/tutorials/source_zh_cn/advanced_use/images/timeline.png
+++ b/tutorials/source_zh_cn/advanced_use/images/timeline.png
--- a/tutorials/source_zh_cn/advanced_use/performance_profiling.md
+++ b/tutorials/source_zh_cn/advanced_use/performance_profiling.md
+# 性能调试
+
+
+<!-- TOC -->
+
+- [性能调试](#性能调试)
+    - [概述](#概述)
+    - [操作流程](#操作流程)
+    - [准备训练脚本](#准备训练脚本)
+    - [启动MindInsight](#启动MindInsight)
+    - [性能分析](#性能分析)
+        - [迭代轨迹分析](#迭代轨迹分析)
+        - [算子性能分析](#算子性能分析)
+        - [MindData性能分析](#MindData性能分析)
+        - [timeline分析](#timeline分析)
+    - [规格](#规格)
+
+<!-- /TOC -->
+
+<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_zh_cn/advanced_use/performance_profiling.md" target="_blank"><img src="../_static/logo_source.png"></a>
+
+## 概述
+将训练过程中的算子耗时等信息记录到文件中，通过可视化界面供用户查看分析，帮助用户更高效地调试神经网络性能。目前仅支持在Ascend芯片上的性能调试。
+
+## 操作流程
+
+- 准备训练脚本，并在训练脚本中调用性能调试接口，接着运行训练脚本。
+- 启动MindInsight，并通过启动参数指定profile文件目录，启动成功后，根据IP和端口访问可视化界面，默认访问地址为 `http://127.0.0.1:8080`。
+- 在训练列表找到对应训练，点击性能分析，即可在页面中查看训练性能数据。
+
+## 准备训练脚本
+
+为了收集神经网络的性能数据，需要在训练脚本中添加MindInsight Profiler接口。首先，在set context之后和初始化网络之前，需要初始化MindInsight `Profiler`对象；
+然后在训练结束后，调用`Profiler.analyse()`停止性能数据收集并生成性能分析结果。
+
+样例代码如下：
+
+```python
+from mindinsight.profiler import Profiler
+from mindspore import Model, nn, context
+
+
+def test_profiler():
+    # Init context env
+    context.set_context(mode=context.GRAPH_MODE, device_target='Ascend', device_id=int(os.environ["DEVICE_ID"]))
+    
+    # Init Profiler
+    profiler = Profiler(output_path='./data', is_detail=True, is_show_op_path=False, subgraph='all')
+    
+    # Init hyperparameter
+    epoch = 2
+    # Init network and Model
+    net = Net()
+    loss_fn = CrossEntropyLoss()
+    optim = MyOptimizer(learning_rate=0.01, params=network.trainable_params())
+    model = Model(net, loss_fn=loss_fn, optimizer=optim, metrics=None)  
+    # Prepare mindrecord_dataset for training
+    train_ds = create_mindrecord_dataset_for_training()
+    # Model Train
+    model.train(epoch, train_ds)
+    
+    # Profiler end
+    profiler.analyse()
+```
+
+## 启动MindInsight
+
+启动命令请参考[训练过程可视](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/visualization_tutorials.html)
+中**MindInsight相关命令**小节。
+
+
+### 性能分析
+
+用户从训练列表中选择指定的训练，点击性能调试，可以查看该次训练的性能数据。
+
+![performance_overall.png](./images/performance_overall.png)
+
+图1： 性能数据总览
+
+图1展示了性能数据总览页面，包含了迭代轨迹（Step Trace）、算子性能、MindData性能和Timeline等组件的数据总体呈现。各组件展示的数据如下：
+
+- 迭代轨迹：将训练Step划分为几个阶段，统计每个阶段的耗时，按时间线进行展示；总览页展示了迭代轨迹图；
+- 算子性能：统计单算子以及各算子类型的执行时间，进行排序展示；总览页中展示了各算子类型时间占比的饼状图；
+- MindData性能：统计训练数据准备各阶段的性能情况；总览页中展示了各阶段性能可能存在瓶颈的step数目；
+- Timeline：按设备统计每个stream中task的耗时情况，在时间轴排列展示；总览页展示了Timeline中stream和task的汇总情况。
+
+用户可以点击查看详情链接，进入某个组件页面进行详细分析。MindInsight也会对性能数据进行分析，在左侧的智能小助手中给出性能调试的建议。
+
+#### 迭代轨迹分析
+
+使用迭代轨迹分析组件可以快速了解训练各阶段在总时长中的占比情况。迭代轨迹将训练的一个step划分为迭代间隙 (两次step执行的间隔时间)、前向与反向执行、all reduce、参数更新等几个阶段，
+并显示出每个阶段的时长，帮助用户定界出性能瓶颈所在的执行阶段。
+
+![step_trace.png](./images/step_trace.png)
+
+图2： 迭代轨迹分析
+
+图2展示了迭代轨迹分析页面。在迭代轨迹详情中，会展示各阶段在训练step中的起止时间，默认显示的是各step的平均值，用户也可以在下拉菜单选择某个step查看该step的迭代轨迹情况。
+在页面下方显示了迭代间隙、前后向计算、迭代拖尾时间（前后向计算结束到参数更新完成的时间）随着step的变化曲线等，用户可以据此判断某个阶段是否存在性能优化空间。
+
+迭代轨迹在做阶段划分时，需要识别前向计算开始的算子和反向计算结束的算子。为了降低用户使用Profiler的门槛，MindSpore会对这两个算子做自动识别，方法为：
+前向计算开始的算子指定为get_next算子之后连接的第一个算子，反向计算结束的算子指定为最后一次all reduce之前连接的算子。**Profiler不保证在所有情况下自动识别的结果和用户的预期一致，
+用户可以根据网络的特点自行调整**，调整方法如下：
+
+- 设置```FP_POINT```环境变量指定前向计算开始的算子，如```export FP_POINT=fp32_vars/conv2d/BatchNorm```
+- 设置```BP_POINT```环境变量指定反向计算结束的算子，如```export BP_POINT=loss_scale/gradients/AddN_70```
+
+#### 算子性能分析
+
+使用算子性能分析组件可以对MindSpore运行过程中的各个算子的执行时间进行统计展示。
+
+![op_type_statistics.png](./images/op_type_statistics.PNG)
+
+图3： 算子类别统计分析
+
+图3展示了按算子类别进行统计分析的结果，包含以下内容：
+
+- 可以选择饼图/柱状图展示各算子类别的时间占比，每个算子类别的执行时间会统计属于该类别的算子执行时间总和；
+- 统计前20个占比时间最长的算子类别，展示其时间所占的百分比以及具体的执行时间（毫秒）。
+
+![op_statistics.png](./images/op_statistics.PNG)
+
+图4： 算子统计分析
+
+图4展示了算子性能统计表，包含以下内容：
+
+- 选择全部：按单个算子的统计结果进行排序展示，展示维度包括算子名称、算子类型、算子执行时间、算子全scope名称、算子信息等；默认按算子执行时间排序；
+- 选择分类：按算子类别的统计结果进行排序展示，展示维度包括算子分类名称、算子类别执行时间、执行频次、占总时间的比例等。点击每个算子类别，可以进一步查看该类别下所有单个算子的统计信息；
+- 搜索：在右侧搜索框中输入字符串，支持对算子名称/类别进行模糊搜索。
+
+#### MindData性能分析
+
+使用MindData性能分析组件可以对训练数据准备过程进行性能分析。数据准备过程可以分为三个阶段：数据处理pipeline、数据发送至device以及device侧读取训练数据，MindData性能分析组件会对每个阶段的处理性能进行详细分析，并将分析结果进行展示。
+
+![minddata_profile.png](./images/minddata_profile.png)
+
+图5： MindData性能分析
+
+图5展示了MindData性能分析页面，包含迭代间隙和数据处理两个TAB页面。
+
+迭代间隙TAB页主要用来分析数据准备三个阶段是否存在性能瓶颈，数据队列图是分析判断的重要依据：
+
+- 数据队列Size代表Device侧从队列取数据时队列的长度，如果数据队列Size为0，则训练会一直等待，直到队列中有数据才会开始某个step的训练；如果数据队列Size大于0，则训练可以快速取到数据，MindData不是该step的瓶颈所在；
+- 主机队列Size可以推断出数据处理和发送速度，如果主机队列Size为0，表示数据处理速度慢而数据发送速度快，需要加快数据处理；
+- 如果主机队列Size一直较大，而数据队列的Size持续很小，则数据发送有可能存在性能瓶颈。
+
+![data_op_profile.png](./images/data_op_profile.png)
+
+图6： 数据处理Pipeline分析
+
+图6展示了数据处理TAB页面，可以对数据处理pipeline做进一步分析。不同的数据算子之间使用队列进行数据交换，队列的长度可以反映出算子处理数据的快慢，进而推断出pipeline中的瓶颈算子所在。
+算子队列的平均使用率代表队列中已有数据Size除以队列最大数据Size的平均值，使用率越高说明队列中数据积累越多。算子队列关系展示了数据处理pipeline中的算子以及它们之间的连接情况，点击某个
+队列可以在下方查看该队列中数据Size随着时间的变化曲线，以及与数据队列连接的算子信息等。对数据处理pipeline的分析有如下建议：
+
+- 当算子左边连接的Queue使用率都比较高，右边连接的Queue使用率比较低，该算子可能是性能瓶颈；
+- 对于最左侧的算子，如果其右边所有Queue的使用率都比较低，该算子可能是性能瓶颈；
+- 对于最右侧的算子，如果其左边所有Queue的使用率都比较高，该算子可能是性能瓶颈。
+
+对于不同的类型的MindData算子，有如下优化建议：
+
+- 如果Dataset算子是性能瓶颈，建议增加num_parallel_workers;
+- 如果GeneratorOp类型的算子是性能瓶颈，建议增加num_parallel_workers，并尝试将其替换为MindRecordDataset;
+- 如果MapOp类型的算子是性能瓶颈，建议增加num_parallel_workers，如果该算子为python算子，可以尝试优化脚本；
+- 如果BatchOp类型的算子是性能瓶颈，建议调整prefetch_size的大小。
+
+
+#### Timeline分析
+
+Timeline组件可以展示：
+
+- 算子分配到哪个设备（AICPU、AICore等）执行;
+- MindSpore对该网络的流切分策略；
+- 算子在Device上的执行序列和执行时长
+
+通过分析Timeline，用户可以对训练过程进行细粒度分析：从High Level层面，可以分析流切分方法是否合理、迭代间隙和拖尾时间是否过长等；从Low Level层面，可以分析
+算子执行时间等。
+
+![timeline.png](./images/timeline.png)
+
+图7： Timeline分析
+
+Timeline主要包含如下几个部分：
+
+- **Device及其stream list**: 包含device上的stream列表，每个stream由task执行序列组成，一个task是其中的一个小方块，大小代表执行时间长短；
+- **算子信息**: 选中某个task后，可以显示该task对应算子的信息，包括名称、type等
+
+可以使用W/A/S/D来放大、缩小地查看Timline图信息
+
+
+## 规格
+
+- 为了控制性能测试时生成数据的大小，大型网络建议性能调试的step数目限制在10以内。
+- Timeline数据的解析比较耗时，且一般几个step的数据即足够分析出结果。出于数据解析和UI展示性能的考虑，Profiler最多展示20M数据（对大型网络20M可以显示10+ step的信息）。
+
+
+
+
+
--- a/tutorials/source_zh_cn/advanced_use/visualization_tutorials.md
+++ b/tutorials/source_zh_cn/advanced_use/visualization_tutorials.md
@@ -22,8 +22,6 @@
        - [模型溯源](#模型溯源)
        - [数据溯源](#数据溯源)
        - [对比看板](#对比看板)
-        - [性能调试](#性能调试)
-            - [算子性能分析](#算子性能分析)
    - [规格](#规格)

 <!-- /TOC -->
@@ -235,42 +233,6 @@ model.train(cnn_network, callbacks=[confusion_martrix])
 > - 目前MindSpore仅支持在Ascend 910 AI处理器上导出算子融合后的计算图。
 > - 在训练中使用Summary算子收集数据时，`HistogramSummary`算子会影响性能，所以请尽量少地使用。

-
-### 性能数据收集
-
-为了收集神经网络的性能数据，需要在训练脚本中添加`MindInsight Profiler`接口。首先，在set context之后和初始化网络之前，需要初始化`MindInsight Profiler`对象；
-然后在训练结束后，调用`Profiler.analyse`停止性能数据收集并生成性能分析结果。
-
-样例代码如下：
-
-```python
-from mindinsight.profiler import Profiler
-from mindspore import Model, nn, context
-
-
-def test_profiler():
-    # Init context env
-    context.set_context(mode=context.GRAPH_MODE, device_target='Ascend', device_id=int(os.environ["DEVICE_ID"]))
-    
-    # Init Profiler
-    profiler = Profiler(output_path='./data', is_detail=True, is_show_op_path=False, subgraph='all')
-    
-    # Init hyperparameter
-    epoch = 2
-    # Init network and Model
-    net = Net()
-    loss_fn = CrossEntropyLoss()
-    optim = MyOptimizer(learning_rate=0.01, params=network.trainable_params())
-    model = Model(net, loss_fn=loss_fn, optimizer=optim, metrics=None)  
-    # Prepare mindrecord_dataset for training
-    train_ds = create_mindrecord_dataset_for_training()
-    # Model Train
-    model.train(epoch, train_ds)
-    
-    # Profiler end
-    profiler.analyse()
-``` 
-
 ## MindInsight相关命令

 ### 查看命令帮助信息
@@ -528,33 +490,6 @@ gunicorn  <PID>  <USER>  <FD>  <TYPE>  <DEVICE>  <SIZE/OFF>  <NODE>  <WORKSPACE>
 - 水平轴：可以选择“步骤”、“相对时间”、“绝对时间”中的任意一项，来作为标量曲线的水平轴。
 - 平滑度：可以通过调整平滑度，对标量曲线进行平滑处理。

-### 性能调试
-
-用户从训练列表中选择指定的训练，进入性能调试。
-
-#### 算子性能分析
-
-使用算子性能分析组件可以对MindSpore运行过程中的各个算子的执行时间进行统计展示。
-
-![op_type_statistics.png](./images/op_type_statistics.PNG)
-
-图19： 算子类别统计分析
-
-图19展示了按算子类别进行统计分析的结果，包含以下内容：
-
- 可以选择饼图/柱状图展示各算子类别的时间占比，每个算子类别的执行时间会统计属于该类别的算子执行时间总和；
- 统计前20个占比时间最长的算子类别，展示其时间所占的百分比以及具体的执行时间（毫秒）。
-
-![op_statistics.png](./images/op_statistics.PNG)
-
-图20： 算子统计分析
-
-图20展示了算子性能统计表，包含以下内容：
-
- 选择全部：按单个算子的统计结果进行排序展示，展示维度包括算子名称、算子类型、算子执行时间、算子全scope名称、算子信息等；默认按算子执行时间排序；
- 选择分类：按算子类别的统计结果进行排序展示，展示维度包括算子分类名称、算子类别执行时间、执行频次、占总时间的比例等。点击每个算子类别，可以进一步查看该类别下所有单个算子的统计信息；
- 搜索：在右侧搜索框中输入字符串，支持对算子名称/类别进行模糊搜索。
-
 ## 规格

 为了控制列出summary列表的用时，MindInsight最多支持发现999个summary列表条目。
@@ -569,5 +504,3 @@ gunicorn  <PID>  <USER>  <FD>  <TYPE>  <DEVICE>  <SIZE/OFF>  <NODE>  <WORKSPACE>
 - 对比看板只支持在缓存中的训练进行比较标量曲线对比。
 - 缓存最多保留最新（按修改时间排列）的15个训练。
 - 用户最多同时对比5个训练的标量曲线。
-
-为了控制性能测试时生成数据的大小，大型网络建议性能调试的step数目限制在10以内。