52.md

# 分析您的 PyTorch 模块

> 原文：<https://pytorch.org/tutorials/beginner/profiler.html>

**作者：** [Suraj Subramanian](https://github.com/suraj813)

PyTorch 包含一个探查器 API，可用于识别代码中各种 PyTorch 操作的时间和内存成本。 Profiler 可以轻松集成到您的代码中，结果可以打印为表格或在 JSON 跟踪文件中显示。

注意

Profiler 支持多线程模型。 Profiler 与该操作在同一线程中运行，但它还将对可能在另一个线程中运行的子运算符进行概要分析。 同时运行的探查器的作用域将限制在其自己的线程中，以防止结果混淆。

转到[此秘籍](https://pytorch.org/tutorials/recipes/recipes/profiler.html)，可以更快地了解 Profiler API 的用法。

* * *

```py
import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

```

## 使用 Profiler 的性能调试

Profiler 有助于识别模型中的性能瓶颈。 在此示例中，我们构建了一个自定义模块，该模块执行两个子任务：

*   输入的线性变换，以及
*   使用转换结果来获取遮罩张量上的索引。

我们使用`profiler.record_function("label")`将每个子任务的代码包装在单独的带标签的上下文管理器中。 在事件探查器输出中，子任务中所有操作的综合性能指标将显示在其相应的标签下。

请注意，使用 Profiler 会产生一些开销，并且最好仅用于调查代码。 如果要对运行时进行基准测试，请记住将其删除。

```py
class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean().item()
            hi_idx = np.argwhere(mask.cpu().numpy() > threshold)
            hi_idx = torch.from_numpy(hi_idx).cuda()

        return out, hi_idx

```

## 分析正向传播

我们初始化随机输入和蒙版张量以及模型。

在运行探查器之前，我们需要对 CUDA 进行预热，以确保进行准确的性能基准测试。 我们将模块的正向传播包装在`profiler.profile`上下文管理器中。 `with_stack=True`参数在跟踪中附加操作的文件和行号。

警告

`with_stack=True`会产生额外的开销，并且更适合于研究代码。 如果要对性能进行基准测试，请记住将其删除。

```py
model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

```

## 打印分析器结果

最后，我们打印分析器结果。 `profiler.key_averages`通过运算符名称，以及可选地通过输入形状和/或栈跟踪事件来聚合结果。 按输入形状分组有助于识别模型使用哪些张量形状。

在这里，我们使用`group_by_stack_n=5`通过操作及其回溯（截断为最近的 5 个事件）聚合运行时，并按事件注册的顺序显示事件。 还可以通过传递`sort_by`参数对表进行排序（有关有效的排序键，请参阅[文档](https://pytorch.org/docs/stable/autograd.html#profiler)）。

注意

在笔记本中运行 Profiler 时，您可能会在栈跟踪中看到`<ipython-input-18-193a910735e8>(13): forward`之类的条目，而不是文件名。 这些对应于`<notebook-cell>(line number): calling-function`。

```py
print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-------------  ------------  ------------  ------------  ---------------------------------
         Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-------------  ------------  ------------  ------------  ---------------------------------
 MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(10): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

  LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au
                                                         <ipython-input-...>(7): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/

  aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(8): forward
                                                         /mnt/xarfuse/.../torch/nn

   aten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward
                                                         /mnt/xarfuse/.../torch/nn
                                                         <ipython-input-...>(9): <module>
                                                         /mnt/xarfuse/.../IPython/
                                                         /mnt/xarfuse/.../IPython/

-----------------------------  ------------  ---------- ----------------------------------
Self CPU time total: 5.931s

"""

```

## 提高内存性能

请注意，就内存和时间而言，最昂贵的操作位于`forward (10)`，代表掩码索引中的操作。 让我们尝试先解决内存消耗问题。 我们可以看到第 12 行的`.to()`操作消耗 953.67 Mb。 该操作将`mask`复制到 CPU。 `mask`使用`torch.double`数据类型初始化。 我们可以通过将其转换为`torch.float`来减少内存占用吗？

```py
model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

-----------------  ------------  ------------  ------------  --------------------------------
             Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-----------------  ------------  ------------  ------------  --------------------------------
     MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au
                                                             <ipython-input-...>(10): forward
                                                             /mnt/xarfuse/  /torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/

      aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

 aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

      aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(8): forward
                                                             /mnt/xarfuse/.../torch/nn

      aten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward
      _scalar_dense                                          /mnt/xarfuse/.../torch/nn
                                                             <ipython-input-...>(9): <module>
                                                             /mnt/xarfuse/.../IPython/
                                                             /mnt/xarfuse/.../IPython/

-----------------  ------------  ------------  ------------  --------------------------------
Self CPU time total: 5.347s

"""

```

此操作的 CPU 内存占用量减少了一半。

## 提高时间表现

虽然所消耗的时间也有所减少，但仍然太高。 原来，将矩阵从 CUDA 复制到 CPU 非常昂贵！ `forward (12)`中的`aten::copy_`运算符将`mask`复制到 CPU，以便可以使用 NumPy `argwhere`函数。 `forward(13)`处的`aten::copy_`将数组作为张量复制回 CUDA。 如果我们在这里使用`torch`函数`nonzero()`，则可以消除这两个方面。

```py
class MyModule(nn.Module):
    def __init__(self, in_features: int, out_features: int, bias: bool = True):
        super(MyModule, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias)

    def forward(self, input, mask):
        with profiler.record_function("LINEAR PASS"):
            out = self.linear(input)

        with profiler.record_function("MASK INDICES"):
            threshold = out.sum(axis=1).mean()
            hi_idx = (mask > threshold).nonzero(as_tuple=True)

        return out, hi_idx

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()

# warm-up
model(input, mask)

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    out, idx = model(input, mask)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))

"""
(Some columns are omitted)

--------------  ------------  ------------  ------------  ---------------------------------
          Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
--------------  ------------  ------------  ------------  ---------------------------------
      aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

   INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au
                                                          <ipython-input-...>(10): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/

aten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/

 aten::nonzero
     _numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward
                                                          /mnt/xarfuse/.../torch/nn
                                                          <ipython-input-...>(25): <module>
                                                          /mnt/xarfuse/.../IPython/
                                                          /mnt/xarfuse/.../IPython/
--------------  ------------  ------------  ------------  ---------------------------------
Self CPU time total: 225.801ms

"""

```

## 进一步阅读

我们已经看到了 Profiler 如何用于调查 PyTorch 模型中的时间和内存瓶颈。 在此处阅读有关 Profiler 的更多信息：

*   [事件探查器使用秘籍](https://pytorch.org/tutorials/recipes/recipes/profiler.html)
*   [分析基于 RPC 的工作负载](https://pytorch.org/tutorials/recipes/distributed_rpc_profiling.html)
*   [Profiler API 文档](https://pytorch.org/docs/stable/autograd.html?highlight=profiler#profiler)

**脚本的总运行时间**：（0 分钟 0.000 秒）

[下载 Python 源码：`profiler.py`](../_downloads/390e82110dc76e71b26225b3f9020e14/profiler.py)

[下载 Jupyter 笔记本：`profiler.ipynb`](../_downloads/28071a0f69f5106129ad8a68a47af061/profiler.ipynb)

[由 Sphinx 画廊](https://sphinx-gallery.readthedocs.io)生成的画廊