doc22_067.md

# torch.profiler

> 原文：[`pytorch.org/docs/stable/profiler.html`](https://pytorch.org/docs/stable/profiler.html)> 
>
> 译者：[飞龙](https://github.com/wizardforcel)
>
> 协议：[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)


## 概述

PyTorch Profiler 是一个工具，允许在训练和推断过程中收集性能指标。Profiler 的上下文管理器 API 可用于更好地了解哪些模型操作符是最昂贵的，检查它们的输入形状和堆栈跟踪，研究设备内核活动并可视化执行跟踪。

注意

`torch.autograd`模块中的早期版本被视为遗留版本，并将被弃用。

## API 参考

```py
class torch.profiler._KinetoProfile(*, activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None)
```

低级别分析器包装自动梯度分析

参数

+   **activities**（*可迭代对象*）- 要在分析中使用的活动组（CPU、CUDA）列表，支持的值：`torch.profiler.ProfilerActivity.CPU`、`torch.profiler.ProfilerActivity.CUDA`。默认值：ProfilerActivity.CPU 和（如果可用）ProfilerActivity.CUDA。

+   **record_shapes**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")）- 保存有关操作符输入形状的信息。

+   **profile_memory**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")）- 跟踪张量内存分配/释放（有关更多详细信息，请参阅`export_memory_timeline`）。

+   **with_stack**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")）- 记录操作的源信息（文件和行号）。

+   **with_flops**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")）- 使用公式估计特定操作符的 FLOPS（矩阵乘法和 2D 卷积）。

+   **with_modules**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")）- 记录模块层次结构（包括函数名称），对应于操作的调用堆栈。例如，如果模块 A 的前向调用的模块 B 的前向包含一个 aten::add 操作，则 aten::add 的模块层次结构是 A.B 请注意，此支持目前仅适用于 TorchScript 模型，而不适用于急切模式模型。

+   **experimental_config**（*_ExperimentalConfig*）- 由像 Kineto 这样的分析器库使用的一组实验选项。请注意，不保证向后兼容性。

注意

此 API 是实验性的，未来可能会更改。

启用形状和堆栈跟踪会导致额外的开销。当指定 record_shapes=True 时，分析器将暂时保留对张量的引用；这可能进一步阻止依赖引用计数的某些优化，并引入额外的张量副本。

```py
add_metadata(key, value)
```

向跟踪文件中添加具有字符串键和字符串值的用户定义的元数据

```py
add_metadata_json(key, value)
```

向跟踪文件中添加具有字符串键和有效 json 值的用户定义的元数据

```py
events()
```

返回未聚合的分析器事件列表，用于在跟踪回调中使用或在分析完成后使用

```py
export_chrome_trace(path)
```

以 Chrome JSON 格式导出收集的跟踪信息。

```py
export_memory_timeline(path, device=None)
```

从收集的树中导出分析器的内存事件信息，用于给定设备，并导出时间线图。使用`export_memory_timeline`有 3 个可导出的文件，每个文件由`path`的后缀控制。

+   要生成 HTML 兼容的绘图，请使用后缀`.html`，内存时间线图将嵌入到 HTML 文件中作为 PNG 文件。

+   对于由`[times, [sizes by category]]`组成的绘图点，其中`times`是时间戳，`sizes`是每个类别的内存使用量。内存时间线图将保存为 JSON（`.json`）或经过 gzip 压缩的 JSON（`.json.gz`），具体取决于后缀。

+   对于原始内存点，请使用后缀`.raw.json.gz`。每个原始内存事件将包括`(时间戳，操作，字节数，类别)`，其中`操作`是`[PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY]`之一，`类别`是`torch.profiler._memory_profiler.Category`中的枚举之一。

输出：内存时间线以 gzipped JSON、JSON 或 HTML 形式编写。

```py
export_stacks(path, metric='self_cpu_time_total')
```

将堆栈跟踪保存在适合可视化的文件中。

参数

+   **path**（[*str*](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")）- 将堆栈文件保存到此位置；

+   **metric**（[*str*](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")）- 要使用的度量标准：“self_cpu_time_total”或“self_cuda_time_total”

注意

使用 FlameGraph 工具的示例：

+   git clone [`github.com/brendangregg/FlameGraph`](https://github.com/brendangregg/FlameGraph)

+   cd FlameGraph

+   ./flamegraph.pl –title “CPU time” –countname “us.” profiler.stacks > perf_viz.svg

```py
key_averages(group_by_input_shape=False, group_by_stack_n=0)
```

通过运算符名称和（可选）输入形状和堆栈对事件进行平均分组。

注意

要使用形状/堆栈功能，请确保在创建分析器上下文管理器时设置 record_shapes/with_stack。

```py
class torch.profiler.profile(*, activities=None, schedule=None, on_trace_ready=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, use_cuda=None)
```

分析器上下文管理器。

参数

+   **activities**（*iterable*）- 用于分析的活动组（CPU，CUDA）列表，支持的值：`torch.profiler.ProfilerActivity.CPU`，`torch.profiler.ProfilerActivity.CUDA`。默认值：ProfilerActivity.CPU 和（如果可用）ProfilerActivity.CUDA。

+   **schedule**（*Callable*）- 接受步骤（int）作为单个参数并返回指定在每个步骤执行的分析器操作的`ProfilerAction`值的可调用对象。

+   **on_trace_ready**（*Callable*）- 在分析期间`schedule`返回`ProfilerAction.RECORD_AND_SAVE`时在每个步骤调用的可调用对象。

+   **record_shapes**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）- 保存有关运算符输入形状的信息。

+   **profile_memory**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）- 跟踪张量内存分配/释放。

+   **with_stack**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）- 记录操作的源信息（文件和行号）。

+   **with_flops**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）- 使用公式估算特定运算符（矩阵乘法和 2D 卷积）的 FLOPs（浮点运算）。

+   **with_modules**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）- 记录与操作的调用堆栈对应的模块层次结构（包括函数名称）。例如，如果模块 A 的前向调用模块 B 的前向，其中包含一个 aten::add 操作，则 aten::add 的模块层次结构是 A.B。请注意，此支持目前仅适用于 TorchScript 模型，而不适用于急切模式模型。

+   **experimental_config**（*_ExperimentalConfig*）- 用于 Kineto 库功能的一组实验选项。请注意，不保证向后兼容性。

+   **use_cuda**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(在 Python v3.12 中)")）-

    自 1.8.1 版本起已弃用：请改用`activities`。

注意

使用`schedule()`生成可调度的调度。非默认调度在分析长时间训练作业时很有用，并允许用户在训练过程的不同迭代中获取多个跟踪。默认调度仅在上下文管理器的持续时间内连续记录所有事件。

注意

使用`tensorboard_trace_handler()`生成 TensorBoard 的结果文件：

`on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name)`

分析后，结果文件可以在指定目录中找到。使用命令：

`tensorboard --logdir dir_name`

在 TensorBoard 中查看结果。有关更多信息，请参阅[PyTorch Profiler TensorBoard 插件](https://github.com/pytorch/kineto/tree/master/tb_plugin)

注意

启用形状和堆栈跟踪会导致额外的开销。当指定 record_shapes=True 时，分析器将暂时保留对张量的引用；这可能进一步阻止依赖引用计数的某些优化，并引入额外的张量副本。

示例：

```py
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1)) 
```

使用分析器的`schedule`、`on_trace_ready`和`step`函数：

```py
# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    # prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],

    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step

    schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(N):
            code_iteration_to_profile(iter)
            # send a signal to the profiler that the next iteration has started
            p.step() 
```

```py
step()
```

信号分析器下一个分析步骤已经开始。

```py
class torch.profiler.ProfilerAction(value)
```

在指定间隔可以执行的分析器操作

```py
class torch.profiler.ProfilerActivity
```

成员：

CPU

XPU

MTIA

CUDA

```py
property name
```

```py
torch.profiler.schedule(*, wait, warmup, active, repeat=0, skip_first=0)
```

返回一个可用作分析器`schedule`参数的可调用对象。分析器将跳过前`skip_first`步，然后等待`wait`步，然后为接下来的`warmup`步进行预热，然后为接下来的`active`步进行活动记录，然后重复以`wait`步开始的循环。循环的可选次数由`repeat`参数指定，零值表示循环将持续直到分析完成。

返回类型

[*Callable*](https://docs.python.org/3/library/typing.html#typing.Callable "(在 Python v3.12 中)")

```py
torch.profiler.tensorboard_trace_handler(dir_name, worker_name=None, use_gzip=False)
```

将跟踪文件输出到`dir_name`目录，然后该目录可以直接作为 logdir 传递给 tensorboard。在分布式场景中，`worker_name`应该对每个 worker 是唯一的，默认情况下将设置为‘[hostname]_[pid]’。

## 英特尔仪器和跟踪技术 APIs

```py
torch.profiler.itt.is_available()
```

检查 ITT 功能是否可用

```py
torch.profiler.itt.mark(msg)
```

描述在某个时间点发生的瞬时事件。

参数

**msg** ([*str*](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")) – 与事件关联的 ASCII 消息。

```py
torch.profiler.itt.range_push(msg)
```

将范围推送到嵌套范围堆栈上。返回开始的范围的从零开始的深度。

参数

**msg** ([*str*](https://docs.python.org/3/library/stdtypes.html#str "(在 Python v3.12 中)")) – 与范围关联的 ASCII 消息

```py
torch.profiler.itt.range_pop()
```

从嵌套范围跨度堆栈中弹出一个范围。返回结束的范围的从零开始的深度。