2024-02-04 13:17:19

26c2d4ae · 绝不原创的飞龙 · 6a912f80 · 26c2d4ae · 26c2d4ae · 26c2d4ae
21 changed file
--- a/totrans/tut22_070.yaml
+++ b/totrans/tut22_070.yaml
@@ -212,11 +212,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 该函数返回一个数据框的元组。第一个数据框包含每个流中每个等级的类别空闲时间。
- en: '[![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)](../_images/idle_time.png)'
+- en: '![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)'
  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)](../_images/idle_time.png)'
+  zh: '![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)'
 - en: The second dataframe is generated when `show_idle_interval_stats` is set to
    `True`. It contains the summary statistics of the idle time for each stream on
    each rank.
@@ -224,11 +224,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 第二个数据框是在将`show_idle_interval_stats`设置为`True`时生成的。它包含每个流在每个rank上的空闲时间的摘要统计信息。
- en: '[![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)](../_images/idle_time_summary.png)'
+- en: '![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)'
  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)](../_images/idle_time_summary.png)'
+  zh: '![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)'
 - en: Tip
  id: totrans-37
  prefs: []
@@ -412,11 +412,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 该函数返回一个包含每个rank的重叠百分比的数据框。
- en: '[![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)](../_images/overlap_df.png)'
+- en: '![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)'
  id: totrans-66
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)](../_images/overlap_df.png)'
+  zh: '![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)'
 - en: When the `visualize` argument is set to True, the [get_comm_comp_overlap](https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_comm_comp_overlap)
    function also generates a bar graph representing the overlap by rank.
  id: totrans-67
@@ -474,11 +474,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 生成的带有增强计数器的跟踪文件的屏幕截图。
- en: '[![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)](../_images/mem_bandwidth_queue_length.png)'
+- en: '![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)'
  id: totrans-75
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)](../_images/mem_bandwidth_queue_length.png)'
+  zh: '![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)'
 - en: 'HTA also provides a summary of the memory copy bandwidth and queue length counters
    as well as the time series of the counters for the profiled portion of the code
    using the following API:'
@@ -526,11 +526,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 摘要包含计数、最小值、最大值、平均值、标准差、25th、50th和75th百分位数。
- en: '[![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)](../_images/queue_length_summary.png)'
+- en: '![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)'
  id: totrans-84
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)](../_images/queue_length_summary.png)'
+  zh: '![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)'
 - en: The time series only contains the points when a value changes. Once a value
    is observed the time series stays constant until the next update. The memory bandwidth
    and queue length time series functions return a dictionary whose key is the rank
@@ -572,11 +572,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 下面给出了生成的数据框的屏幕截图。
- en: '[![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)](../_images/cuda_kernel_launch_stats.png)'
+- en: '![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)'
  id: totrans-91
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)](../_images/cuda_kernel_launch_stats.png)'
+  zh: '![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)'
 - en: 'The duration of the CPU op, GPU kernel, and the launch delay allow us to find
    the following:'
  id: totrans-92

--- a/totrans/tut22_085.yaml
+++ b/totrans/tut22_085.yaml
@@ -90,11 +90,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 我们可以看到对于x的梯度本身是x的函数（dout/dx = 2x），并且这个函数的图形已经正确构建
- en: '[![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)](https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png)'
+- en: '![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)'
  id: totrans-14
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)](https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png)'
+  zh: '![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)'
 - en: Saving the Outputs
  id: totrans-15
  prefs:
@@ -122,11 +122,11 @@
  prefs: []
  type: TYPE_PRE
  zh: '[PRE3]'
- en: '[![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)](https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png)'
+- en: '![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)'
  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)](https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png)'
+  zh: '![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)'
 - en: Saving Intermediate Results
  id: totrans-21
  prefs:
@@ -174,11 +174,11 @@
  prefs: []
  type: TYPE_PRE
  zh: '[PRE5]'
- en: '[![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)](https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png)'
+- en: '![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)'
  id: totrans-29
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)](https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png)'
+  zh: '![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)'
 - en: 'Saving Intermediate Results: What not to do'
  id: totrans-30
  prefs:
@@ -207,11 +207,11 @@
  prefs: []
  type: TYPE_PRE
  zh: '[PRE7]'
- en: '[![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)](https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png)'
+- en: '![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)'
  id: totrans-35
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)](https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png)'
+  zh: '![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)'
 - en: When Backward is not Tracked
  id: totrans-36
  prefs:
@@ -242,11 +242,11 @@
  prefs: []
  type: TYPE_PRE
  zh: '[PRE9]'
- en: '[![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)](https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png)'
+- en: '![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)'
  id: totrans-41
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)](https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png)'
+  zh: '![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)'
 - en: To conclude, whether double backward works for your custom function simply depends
    on whether the backward pass can be tracked by autograd. With the first two examples
    we show situations where double backward works out of the box. With the third

--- a/totrans/tut22_094.yaml
+++ b/totrans/tut22_094.yaml
@@ -317,11 +317,11 @@
  - PREF_UL
  type: TYPE_NORMAL
  zh: 概述
- en: '[![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)](../_static/img/profiler_overview1.png)'
+- en: '![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)'
  id: totrans-53
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)](../_static/img/profiler_overview1.png)'
+  zh: '![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)'
 - en: The overview shows a high-level summary of model performance.
  id: totrans-54
  prefs: []
@@ -369,11 +369,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 操作员视图显示了在主机或设备上执行的每个PyTorch操作员的性能。
- en: '[![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)](../_static/img/profiler_operator_view.png)'
+- en: '![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)'
  id: totrans-62
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)](../_static/img/profiler_operator_view.png)'
+  zh: '![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)'
 - en: The “Self” duration does not include its child operators’ time. The “Total”
    duration includes its child operators’ time.
  id: totrans-63
@@ -393,22 +393,22 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 单击操作员的“查看调用堆栈”，将显示具有相同名称但不同调用堆栈的操作员。然后单击此子表中的“查看调用堆栈”，将显示调用堆栈帧。
- en: '[![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)](../_static/img/profiler_callstack.png)'
+- en: '![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)'
  id: totrans-66
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)](../_static/img/profiler_callstack.png)'
+  zh: '![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)'
 - en: If the TensorBoard is launched inside VS Code ([Launch Guide](https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/#tensorboard-integration)),
    clicking a call stack frame will navigate to the specific code line.
  id: totrans-67
  prefs: []
  type: TYPE_NORMAL
  zh: 如果在VS Code中启动了TensorBoard（[启动指南](https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/#tensorboard-integration)），单击调用堆栈帧将导航到特定的代码行。
- en: '[![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)](../_static/img/profiler_vscode.png)'
+- en: '![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)'
  id: totrans-68
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)](../_static/img/profiler_vscode.png)'
+  zh: '![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)'
 - en: Kernel view
  id: totrans-69
  prefs:
@@ -420,11 +420,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: GPU内核视图显示GPU上花费的所有内核时间。
- en: '[![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)](../_static/img/profiler_kernel_view.png)'
+- en: '![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)'
  id: totrans-71
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)](../_static/img/profiler_kernel_view.png)'
+  zh: '![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)'
 - en: 'Tensor Cores Used: Whether this kernel uses Tensor Cores.'
  id: totrans-72
  prefs: []
@@ -458,11 +458,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 跟踪视图显示了受监视的操作员和GPU内核的时间轴。您可以选择它以查看以下详细信息。
- en: '[![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)](../_static/img/profiler_trace_view1.png)'
+- en: '![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)'
  id: totrans-77
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)](../_static/img/profiler_trace_view1.png)'
+  zh: '![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)'
 - en: You can move the graph and zoom in/out with the help of right side toolbar.
    And keyboard can also be used to zoom and move around inside the timeline. The
    ‘w’ and ‘s’ keys zoom in centered around the mouse, and the ‘a’ and ‘d’ keys move
@@ -478,11 +478,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 如果后向操作员的“传入流”字段的值为“前向对应后向”，则可以单击文本以获取其启动前向操作员。
- en: '[![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)](../_static/img/profiler_trace_view_fwd_bwd.png)'
+- en: '![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)'
  id: totrans-80
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)](../_static/img/profiler_trace_view_fwd_bwd.png)'
+  zh: '![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)'
 - en: In this example, we can see the event prefixed with `enumerate(DataLoader)`
    costs a lot of time. And during most of this period, the GPU is idle. Because
    this function is loading data and transforming data on host side, during which
@@ -523,22 +523,22 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 然后在左侧的“Runs”下拉列表中选择最近分析的运行。
- en: '[![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)](../_static/img/profiler_overview2.png)'
+- en: '![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)'
  id: totrans-87
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)](../_static/img/profiler_overview2.png)'
+  zh: '![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)'
 - en: From the above view, we can find the step time is reduced to about 76ms comparing
    with previous run’s 132ms, and the time reduction of `DataLoader` mainly contributes.
  id: totrans-88
  prefs: []
  type: TYPE_NORMAL
  zh: 从上述视图中，我们可以看到步骤时间与之前的运行相比减少到约76ms，而`DataLoader`的时间减少主要起作用。
- en: '[![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)](../_static/img/profiler_trace_view2.png)'
+- en: '![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)'
  id: totrans-89
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)](../_static/img/profiler_trace_view2.png)'
+  zh: '![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)'
 - en: From the above view, we can see that the runtime of `enumerate(DataLoader)`
    is reduced, and the GPU utilization is increased.
  id: totrans-90
@@ -579,11 +579,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 分析器在分析过程中记录所有内存分配/释放事件和分配器的内部状态。内存视图由以下三个组件组成。
- en: '[![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)](../_static/img/profiler_memory_view.png)'
+- en: '![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)'
  id: totrans-97
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)](../_static/img/profiler_memory_view.png)'
+  zh: '![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)'
 - en: The components are memory curve graph, memory events table and memory statistics
    table, from top to bottom, respectively.
  id: totrans-98
@@ -606,11 +606,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 内存曲线显示内存消耗的趋势。“已分配”曲线显示实际使用的总内存，例如张量。在PyTorch中，CUDA分配器和一些其他分配器采用了缓存机制。“保留”曲线显示分配器保留的总内存。您可以在图表上左键单击并拖动以选择所需范围内的事件：
- en: '[![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)](../_static/img/profiler_memory_curve_selecting.png)'
+- en: '![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)'
  id: totrans-101
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)](../_static/img/profiler_memory_curve_selecting.png)'
+  zh: '![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)'
 - en: After selection, the three components will be updated for the restricted time
    range, so that you can gain more information about it. By repeating this process,
    you can zoom into a very fine-grained detail. Right click on the graph will reset
@@ -619,11 +619,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 选择后，这三个组件将针对受限时间范围进行更新，以便您可以获取更多信息。通过重复这个过程，您可以深入了解非常细微的细节。右键单击图表将重置图表到初始状态。
- en: '[![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)](../_static/img/profiler_memory_curve_single.png)'
+- en: '![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)'
  id: totrans-103
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)](../_static/img/profiler_memory_curve_single.png)'
+  zh: '![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)'
 - en: In the memory events table, the allocation and release events are paired into
    one entry. The “operator” column shows the immediate ATen operator that is causing
    the allocation. Notice that in PyTorch, ATen operators commonly use `aten::empty`
@@ -672,11 +672,11 @@
  prefs: []
  type: TYPE_PRE
  zh: '[PRE12]'
- en: '[![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)](../_static/img/profiler_distributed_view.png)'
+- en: '![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)'
  id: totrans-110
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)](../_static/img/profiler_distributed_view.png)'
+  zh: '![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)'
 - en: The “Computation/Communication Overview” shows computation/communication ratio
    and their overlapping degree. From this view, User can figure out load balance
    issue among workers. For example, if the computation + overlapping time of one
@@ -810,11 +810,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 选择不同的视图，如**步骤4**中所述。例如，下面是**操作员**视图：
- en: '[![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)](../_static/img/profiler_rocm_tensorboard_operartor_view.png)'
+- en: '![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)'
  id: totrans-133
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)](../_static/img/profiler_rocm_tensorboard_operartor_view.png)'
+  zh: '![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)'
 - en: At the time this section is written, **Trace** view does not work and it displays
    nothing. You can work around by typing `chrome://tracing` in your Chrome Browser.
  id: totrans-134
@@ -841,11 +841,11 @@
  - PREF_UL
  type: TYPE_NORMAL
  zh: 点击**加载**按钮，从浏览器中的`chrome://tracing`页面加载跟踪JSON文件。
- en: '[![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)](../_static/img/profiler_rocm_chrome_trace_view.png)'
+- en: '![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)'
  id: totrans-138
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)](../_static/img/profiler_rocm_chrome_trace_view.png)'
+  zh: '![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)'
 - en: As mentioned previously, you can move the graph and zoom in and out. You can
    also use keyboard to zoom and move around inside the timeline. The `w` and `s`
    keys zoom in centered around the mouse, and the `a` and `d` keys move the timeline

--- a/totrans/tut22_102.yaml
+++ b/totrans/tut22_102.yaml
--- a/totrans/tut22_103.yaml
+++ b/totrans/tut22_103.yaml
@@ -35,11 +35,11 @@
  type: TYPE_NORMAL
  zh: 在本教程中，我们将通过[英特尔® PyTorch*扩展启动器](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md)演示如何通过内存分配器提高性能，并通过[英特尔®
    PyTorch*扩展](https://github.com/intel/intel-extension-for-pytorch)在CPU上优化内核，并将它们应用于TorchServe，展示ResNet50的吞吐量提升了7.71倍，BERT的吞吐量提升了2.20倍。
- en: '[![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)](../_images/1.png)'
+- en: '![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)'
  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)](../_images/1.png)'
+  zh: '![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)'
 - en: Prerequisites
  id: totrans-6
  prefs:
@@ -81,11 +81,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 在调整CPU以获得最佳性能时，了解瓶颈所在是很有用的。大多数CPU核心都有芯片上的性能监控单元（PMUs）。PMUs是CPU核心内的专用逻辑单元，用于计算系统上发生的特定硬件事件。这些事件的示例可能是缓存未命中或分支误预测。PMUs用于自顶向下的微体系结构分析（TMA）以识别瓶颈。TMA包括如下层次结构：
- en: '[![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)](../_images/26.png)'
+- en: '![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)'
  id: totrans-11
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)](../_images/26.png)'
+  zh: '![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)'
 - en: 'The top level, level-1, metrics collect *Retiring*, *Bad Speculation*, *Front
    End Bound*, *Back End Bound*. The pipeline of CPU can conceptually be simplified
    and divided into two: the frontend and the backend. The *frontend* is responsible
@@ -280,11 +280,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们收集一级 TMA 指标。
- en: '[![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)](../_images/32.png)'
+- en: '![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)'
  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)](../_images/32.png)'
+  zh: '![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)'
 - en: Level-1 TMA shows that both PTMalloc and JeMalloc are bounded by the backend.
    More than half of the execution time was stalled by the backend. Let’s go one
    level deeper.
@@ -292,22 +292,22 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 一级 TMA 显示 PTMalloc 和 JeMalloc 都受后端限制。超过一半的执行时间被后端阻塞。让我们再深入一层。
- en: '[![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)](../_images/41.png)'
+- en: '![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)'
  id: totrans-39
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)](../_images/41.png)'
+  zh: '![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)'
 - en: Level-2 TMA shows that the Back End Bound was caused by Memory Bound. Let’s
    go one level deeper.
  id: totrans-40
  prefs: []
  type: TYPE_NORMAL
  zh: 二级 TMA 显示后端受限是由内存受限引起的。让我们再深入一层。
- en: '[![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)](../_images/51.png)'
+- en: '![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)'
  id: totrans-41
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)](../_images/51.png)'
+  zh: '![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)'
 - en: Most of the metrics under the Memory Bound identify which level of the memory
    hierarchy from the L1 cache to main memory is the bottleneck. A hotspot bounded
    at a given level indicates that most of the data was being retrieved from that
@@ -326,11 +326,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们看看 Intel® VTune Profiler ITT 跟踪。在示例脚本中，我们已经注释了推理循环的每个 *step_x*。
- en: '[![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)](../_images/61.png)'
+- en: '![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)'
  id: totrans-44
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)](../_images/61.png)'
+  zh: '![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)'
 - en: Each step is traced in the timeline graph. The duration of model inference on
    the last step (step_99) decreased from 304.308 ms to 261.843 ms.
  id: totrans-45
@@ -390,21 +390,21 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们收集一级 TMA 指标。
- en: '[![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)](../_images/71.png)'
+- en: '![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)'
  id: totrans-55
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)](../_images/71.png)'
+  zh: '![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)'
 - en: Let’s go one level deeper.
  id: totrans-56
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们再深入一层。
- en: '[![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)](../_images/81.png)'
+- en: '![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)'
  id: totrans-57
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)](../_images/81.png)'
+  zh: '![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)'
 - en: Let’s use Intel® VTune Profiler ITT to annotate [TorchServe inference scope](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L188)
    to profile at inference-level granularity. As [TorchServe Architecture](https://github.com/pytorch/serve/blob/master/docs/internals.md#torchserve-architecture)
    consists of several sub-components, including the Java frontend for handling request/response,
@@ -417,22 +417,22 @@
    进行注释，以便以推理级别的粒度进行分析。由于 [TorchServe 架构](https://github.com/pytorch/serve/blob/master/docs/internals.md#torchserve-architecture)
    包括几个子组件，包括用于处理请求/响应的 Java 前端和用于在模型上运行实际推理的 Python 后端，因此使用 Intel® VTune Profiler
    ITT 限制在推理级别收集跟踪数据是有帮助的。
- en: '[![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)](../_images/9.png)'
+- en: '![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)'
  id: totrans-59
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)](../_images/9.png)'
+  zh: '![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)'
 - en: Each inference call is traced in the timeline graph. The duration of the last
    model inference decreased from 561.688 ms to 251.287 ms - 2.2x speedup.
  id: totrans-60
  prefs: []
  type: TYPE_NORMAL
  zh: 每个推断调用都在时间线图中被跟踪。最后一个模型推断的持续时间从561.688毫秒减少到251.287毫秒 - 加速2.2倍。
- en: '[![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)](../_images/101.png)'
+- en: '![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)'
  id: totrans-61
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)](../_images/101.png)'
+  zh: '![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)'
 - en: The timeline graph can be expanded to see op-level profiling results. The duration
    of *aten::conv2d* decreased from 16.401 ms to 6.392 ms - 2.6x speedup.
  id: totrans-62
@@ -594,21 +594,21 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 该模型由两个操作组成——Conv2d和ReLU。通过打印模型对象，我们得到以下输出。
- en: '[![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)](../_images/11.png)'
+- en: '![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)'
  id: totrans-89
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)](../_images/11.png)'
+  zh: '![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)'
 - en: Let’s collect level-1 TMA metrics.
  id: totrans-90
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们收集一级TMA指标。
- en: '[![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)](../_images/121.png)'
+- en: '![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)'
  id: totrans-91
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)](../_images/121.png)'
+  zh: '![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)'
 - en: Notice the Back End Bound reduced from 68.9 to 38.5 – 1.8x speedup.
  id: totrans-92
  prefs: []
@@ -619,11 +619,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 此外，让我们使用PyTorch Profiler进行性能分析。
- en: '[![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)](../_images/131.png)'
+- en: '![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)'
  id: totrans-94
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)](../_images/131.png)'
+  zh: '![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)'
 - en: Notice the CPU time reduced from 851 us to 310 us – 2.7X speedup.
  id: totrans-95
  prefs: []
@@ -679,11 +679,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们收集一级TMA指标。
- en: '[![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)](../_images/141.png)'
+- en: '![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)'
  id: totrans-103
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)](../_images/141.png)'
+  zh: '![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)'
 - en: Notice the Back End Bound reduced from 67.1 to 37.5 – 1.8x speedup.
  id: totrans-104
  prefs: []
@@ -694,11 +694,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 此外，让我们使用PyTorch Profiler进行性能分析。
- en: '[![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)](../_images/151.png)'
+- en: '![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)'
  id: totrans-106
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)](../_images/151.png)'
+  zh: '![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)'
 - en: 'Notice that with Intel® Extension for PyTorch* Conv + ReLU operators are fused,
    and the CPU time reduced from 803 us to 248 us – 3.2X speedup. The oneDNN eltwise
    post-op enables fusing a primitive with an elementwise primitive. This is one
@@ -765,22 +765,22 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 我们将使用[oneDNN详细模式](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html)，这是一个帮助收集有关oneDNN图级别信息的工具，例如操作融合、执行oneDNN原语所花费的内核执行时间。有关更多信息，请参考[oneDNN文档](https://oneapi-src.github.io/oneDNN/index.html)。
- en: '[![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)](../_images/161.png)[![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)](../_images/171.png)'
+- en: '![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)'
  id: totrans-115
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)](../_images/161.png)[![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)](../_images/171.png)'
+  zh: '![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)'
 - en: Above is oneDNN verbose from channels first. We can verify that there are reorders
    from weight and data, then do computation, and finally reorder output back.
  id: totrans-116
  prefs: []
  type: TYPE_NORMAL
  zh: 以上是来自通道首的oneDNN详细信息。我们可以验证从权重和数据进行重新排序，然后进行计算，最后将输出重新排序。
- en: '[![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)](../_images/181.png)'
+- en: '![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)'
  id: totrans-117
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)](../_images/181.png)'
+  zh: '![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)'
 - en: Above is oneDNN verbose from channels last. We can verify that channels last
    memory format avoids unnecessary reorders.
  id: totrans-118
@@ -799,11 +799,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 以下总结了TorchServe与Intel® Extension for PyTorch*在ResNet50和BERT-base-uncased上的性能提升。
- en: '[![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)](../_images/191.png)'
+- en: '![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)'
  id: totrans-121
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)](../_images/191.png)'
+  zh: '![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)'
 - en: Exercise with TorchServe
  id: totrans-122
  prefs:
@@ -839,11 +839,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 让我们收集一级TMA指标。
- en: '[![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)](../_images/20.png)'
+- en: '![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)'
  id: totrans-128
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)](../_images/20.png)'
+  zh: '![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)'
 - en: Level-1 TMA shows that both are bounded by the backend. As discussed earlier,
    the majority of untuned deep learning workloads will be Back End Bound. Notice
    the Back End Bound reduced from 70.0 to 54.1\. Let’s go one level deeper.
@@ -851,11 +851,11 @@
  prefs: []
  type: TYPE_NORMAL
  zh: Level-1 TMA 显示两者都受到后端的限制。正如之前讨论的，大多数未调整的深度学习工作负载将受到后端的限制。注意后端限制从70.0降至54.1。让我们再深入一层。
- en: '[![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)](../_images/211.png)'
+- en: '![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)'
  id: totrans-130
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)](../_images/211.png)'
+  zh: '![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)'
 - en: As discussed earlier, Back End Bound has two submetrics – Memory Bound and Core
    Bound. Memory Bound indicates the workload is under-optimized or under-utilized,
    and ideally memory-bound operations can be improved to core-bound by optimizing
@@ -866,11 +866,11 @@
  type: TYPE_NORMAL
  zh: 如前所述，后端绑定有两个子指标 - 内存绑定和核心绑定。内存绑定表示工作负载未经优化或未充分利用，理想情况下，内存绑定操作可以通过优化OPs和改善缓存局部性来改善为核心绑定。Level-2
    TMA显示后端绑定从内存绑定改善为核心绑定。让我们深入一层。
- en: '[![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)](../_images/221.png)'
+- en: '![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)'
  id: totrans-132
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)](../_images/221.png)'
+  zh: '![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)'
 - en: Scaling deep learning models for production on a model serving framework like
    TorchServe requires high compute utilization. This requires that data is available
    through prefetching and reusing the data in cache when the execution units need
@@ -888,22 +888,22 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 与TorchServe之前的练习一样，让我们使用Intel® VTune Profiler ITT来注释[TorchServe推断范围](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L188)，以便以推断级别的粒度进行分析。
- en: '[![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)](../_images/231.png)'
+- en: '![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)'
  id: totrans-135
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)](../_images/231.png)'
+  zh: '![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)'
 - en: Each inference call is traced in the timeline graph. The duration of the last
    inference call decreased from 215.731 ms to 95.634 ms - 2.3x speedup.
  id: totrans-136
  prefs: []
  type: TYPE_NORMAL
  zh: 时间轴图中跟踪了每个推断调用。最后一个推断调用的持续时间从215.731毫秒减少到95.634毫秒 - 2.3倍加速。
- en: '[![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)](../_images/241.png)'
+- en: '![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)'
  id: totrans-137
  prefs: []
  type: TYPE_NORMAL
-  zh: '[![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)](../_images/241.png)'
+  zh: '![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)'
 - en: The timeline graph can be expanded to see op-level profiling results. Notice
    that Conv + ReLU has been fused, and the duration decreased from 6.393 ms + 1.731
    ms to 3.408 ms - 2.4x speedup.

--- a/totrans/tut22_109.yaml
+++ b/totrans/tut22_109.yaml
@@ -477,14 +477,16 @@
  prefs: []
  type: TYPE_NORMAL
  zh: 随意调整控制softmax函数软度和损失系数的温度参数。在神经网络中，很容易包含额外的损失函数到主要目标中，以实现更好的泛化。让我们尝试为学生包含一个目标，但现在让我们专注于他们的隐藏状态而不是输出层。我们的目标是通过包含一个天真的损失函数，使得随着损失的减少，传递给分类器的后续展平向量变得更加“相似”，从而将信息从教师的表示传达给学生。当然，教师不会更新其权重，因此最小化仅取决于学生的权重。这种方法背后的理念是，我们假设教师模型具有更好的内部表示，学生不太可能在没有外部干预的情况下实现，因此我们人为地推动学生模仿教师的内部表示。这是否最终会帮助学生并不明显，因为推动轻量级网络达到这一点可能是一件好事，假设我们已经找到了导致更好测试准确性的内部表示，但也可能是有害的，因为网络具有不同的架构，学生没有与教师相同的学习能力。换句话说，没有理由要求这两个向量，学生的和教师的，每个分量都匹配。学生可能达到教师的一个排列的内部表示，这样同样有效。尽管如此，我们仍然可以运行一个快速实验来了解这种方法的影响。我们将使用`CosineEmbeddingLoss`，其公式如下：
- en: '[![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)](../_static/img/knowledge_distillation/cosine_embedding_loss.png)'
+- en: '![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)'
  id: totrans-71
  prefs: []
  type: TYPE_NORMAL
+  zh: '![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)'
 - en: Formula for CosineEmbeddingLoss
  id: totrans-72
  prefs: []
  type: TYPE_NORMAL
+  zh: CosineEmbeddingLoss的公式
 - en: Obviously, there is one thing that we need to resolve first. When we applied
    distillation to the output layer we mentioned that both networks have the same
    number of neurons, equal to the number of classes. However, this is not the case
@@ -497,6 +499,7 @@
  id: totrans-73
  prefs: []
  type: TYPE_NORMAL
+  zh: 显然，我们首先需要解决一件事情。当我们将蒸馏应用于输出层时，我们提到两个网络具有相同数量的神经元，等于类的数量。然而，在跟随我们的卷积层之后的层中并非如此。在这里，老师在最终卷积层展平后拥有比学生更多的神经元。我们的损失函数接受两个相同维度的向量作为输入，因此我们需要以某种方式将它们匹配。我们将通过在老师的卷积层后包含一个平均池化层来解决这个问题，以减少其维度以匹配学生的维度。
 - en: To proceed, we will modify our model classes, or create new ones. Now, the forward
    function returns not only the logits of the network but also the flattened hidden
    representation after the convolutional layer. We include the aforementioned pooling

--- a/totrans/tut22_110.yaml
+++ b/totrans/tut22_110.yaml
 - en: Parallel and Distributed Training
+  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
+  zh: 并行和分布式训练
--- a/totrans/tut22_111.yaml
+++ b/totrans/tut22_111.yaml
 - en: Distributed and Parallel Training Tutorials
+  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
+  zh: 分布式和并行训练教程
 - en: 原文：[https://pytorch.org/tutorials/distributed/home.html](https://pytorch.org/tutorials/distributed/home.html)
+  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
+  zh: 原文：[https://pytorch.org/tutorials/distributed/home.html](https://pytorch.org/tutorials/distributed/home.html)
 - en: Distributed training is a model training paradigm that involves spreading training
    workload across multiple worker nodes, therefore significantly improving the speed
    of training and model accuracy. While distributed training can be used for any
    type of ML model training, it is most beneficial to use it for large models and
    compute demanding tasks as deep learning.
+  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
+  zh: 分布式训练是一种模型训练范式，涉及将训练工作负载分布到多个工作节点，从而显著提高训练速度和模型准确性。虽然分布式训练可用于任何类型的ML模型训练，但对于大型模型和计算密集型任务（如深度学习）使用它最为有益。
 - en: 'There are a few ways you can perform distributed training in PyTorch with each
    method having their advantages in certain use cases:'
+  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
+  zh: 在PyTorch中有几种方法可以进行分布式训练，每种方法在特定用例中都有其优势：
 - en: '[DistributedDataParallel (DDP)](#learn-ddp)'
+  id: totrans-4
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[DistributedDataParallel (DDP)](#learn-ddp)'
 - en: '[Fully Sharded Data Parallel (FSDP)](#learn-fsdp)'
+  id: totrans-5
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[完全分片数据并行（FSDP）](#learn-fsdp)'
 - en: '[Device Mesh](#device-mesh)'
+  id: totrans-6
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[设备网格](#device-mesh)'
 - en: '[Remote Procedure Call (RPC) distributed training](#learn-rpc)'
+  id: totrans-7
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[远程过程调用（RPC）分布式训练](#learn-rpc)'
 - en: '[Custom Extensions](#custom-extensions)'
+  id: totrans-8
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[自定义扩展](#custom-extensions)'
 - en: Read more about these options in [Distributed Overview](../beginner/dist_overview.html).
+  id: totrans-9
  prefs: []
  type: TYPE_NORMAL
+  zh: 在[分布式概述](../beginner/dist_overview.html)中了解更多关于这些选项的信息。
 - en: '## Learn DDP'
+  id: totrans-10
  prefs: []
  type: TYPE_NORMAL
+  zh: '## 学习DDP'
 - en: DDP Intro Video Tutorials
+  id: totrans-11
  prefs: []
  type: TYPE_NORMAL
+  zh: DDP简介视频教程
 - en: A step-by-step video series on how to get started with DistributedDataParallel
    and advance to more complex topics
+  id: totrans-12
  prefs: []
  type: TYPE_NORMAL
+  zh: 一系列逐步视频教程，介绍如何开始使用DistributedDataParallel，并逐步深入更复杂的主题
 - en: Code Video
+  id: totrans-13
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码视频
 - en: Getting Started with Distributed Data Parallel
+  id: totrans-14
  prefs: []
  type: TYPE_NORMAL
+  zh: 开始使用分布式数据并行处理
 - en: This tutorial provides a short and gentle intro to the PyTorch DistributedData
    Parallel.
+  id: totrans-15
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程为PyTorch DistributedData Parallel提供了简短而温和的介绍。
 - en: Code
+  id: totrans-16
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
 - en: Distributed Training with Uneven Inputs Using the Join Context Manager
+  id: totrans-17
  prefs: []
  type: TYPE_NORMAL
+  zh: 使用Join上下文管理器进行不均匀输入的分布式训练
 - en: This tutorial describes the Join context manager and demonstrates it’s use with
    DistributedData Parallel.
+  id: totrans-18
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程描述了Join上下文管理器，并演示了如何与DistributedData Parallel一起使用。
 - en: 'Code  ## Learn FSDP'
+  id: totrans-19
  prefs: []
  type: TYPE_NORMAL
+  zh: '代码  ## 学习FSDP'
 - en: Getting Started with FSDP
+  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
+  zh: 开始使用FSDP
 - en: This tutorial demonstrates how you can perform distributed training with FSDP
    on a MNIST dataset.
+  id: totrans-21
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程演示了如何在MNIST数据集上使用FSDP进行分布式训练。
 - en: Code
+  id: totrans-22
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
 - en: FSDP Advanced
+  id: totrans-23
  prefs: []
  type: TYPE_NORMAL
+  zh: FSDP 高级
 - en: In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5 model
    with FSDP for text summarization.
+  id: totrans-24
  prefs: []
  type: TYPE_NORMAL
+  zh: 在本教程中，您将学习如何使用FSDP对HuggingFace（HF）T5模型进行微调，用于文本摘要。
 - en: 'Code  ## Learn DeviceMesh'
+  id: totrans-25
  prefs: []
  type: TYPE_NORMAL
+  zh: '代码  ## 学习DeviceMesh'
 - en: Getting Started with DeviceMesh
+  id: totrans-26
  prefs: []
  type: TYPE_NORMAL
+  zh: 开始使用DeviceMesh
 - en: In this tutorial you will learn about DeviceMesh and how it can help with distributed
    training.
+  id: totrans-27
  prefs: []
  type: TYPE_NORMAL
+  zh: 在本教程中，您将了解DeviceMesh以及它如何帮助进行分布式训练。
 - en: 'Code  ## Learn RPC'
+  id: totrans-28
  prefs: []
  type: TYPE_NORMAL
+  zh: '代码  ## 学习RPC'
 - en: Getting Started with Distributed RPC Framework
+  id: totrans-29
  prefs: []
  type: TYPE_NORMAL
+  zh: 开始使用分布式RPC框架
 - en: This tutorial demonstrates how to get started with RPC-based distributed training.
+  id: totrans-30
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程演示了如何开始使用基于RPC的分布式训练。
 - en: Code
+  id: totrans-31
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
 - en: Implementing a Parameter Server Using Distributed RPC Framework
+  id: totrans-32
  prefs: []
  type: TYPE_NORMAL
+  zh: 使用分布式RPC框架实现参数服务器
 - en: This tutorial walks you through a simple example of implementing a parameter
    server using PyTorch’s Distributed RPC framework.
+  id: totrans-33
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程将带您完成一个简单的示例，使用PyTorch的分布式RPC框架实现参数服务器。
 - en: Code
+  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
 - en: Implementing Batch RPC Processing Using Asynchronous Executions
+  id: totrans-35
  prefs: []
  type: TYPE_NORMAL
+  zh: 使用异步执行实现批处理RPC处理
 - en: In this tutorial you will build batch-processing RPC applications with the @rpc.functions.async_execution
    decorator.
+  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
+  zh: 在本教程中，您将使用@rpc.functions.async_execution装饰器构建批处理RPC应用程序。
 - en: Code
+  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
 - en: Combining Distributed DataParallel with Distributed RPC Framework
+  id: totrans-38
  prefs: []
  type: TYPE_NORMAL
+  zh: 将分布式DataParallel与分布式RPC框架结合
 - en: In this tutorial you will learn how to combine distributed data parallelism
    with distributed model parallelism.
+  id: totrans-39
  prefs: []
  type: TYPE_NORMAL
+  zh: 在本教程中，您将学习如何将分布式数据并行性与分布式模型并行性结合起来。
 - en: 'Code  ## Custom Extensions'
+  id: totrans-40
  prefs: []
  type: TYPE_NORMAL
+  zh: '代码  ## 自定义扩展'
 - en: Customize Process Group Backends Using Cpp Extensions
+  id: totrans-41
  prefs: []
  type: TYPE_NORMAL
+  zh: 使用Cpp扩展自定义Process Group后端
 - en: In this tutorial you will learn to implement a custom ProcessGroup backend and
    plug that into PyTorch distributed package using cpp extensions.
+  id: totrans-42
  prefs: []
  type: TYPE_NORMAL
+  zh: 在本教程中，您将学习如何实现自定义的ProcessGroup后端，并将其插入到PyTorch分布式包中使用cpp扩展。
 - en: Code
+  id: totrans-43
  prefs: []
  type: TYPE_NORMAL
+  zh: 代码
--- a/totrans/tut22_112.yaml
+++ b/totrans/tut22_112.yaml
 - en: PyTorch Distributed Overview
+  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
 - en: 原文：[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
+  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
 - en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
+  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
 - en: Note
+  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
-    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
+    tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
+  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
 - en: This is the overview page for the `torch.distributed` package. The goal of this
@@ -21,14 +26,17 @@
    of them. If this is your first time building distributed training applications
    using PyTorch, it is recommended to use this document to navigate to the technology
    that can best serve your use case.
+  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
 - en: Introduction
+  id: totrans-6
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
 - en: 'As of PyTorch v1.6.0, features in `torch.distributed` can be categorized into
    three main components:'
+  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
 - en: '[Distributed Data-Parallel Training](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
@@ -37,6 +45,7 @@
    fed with a different set of input data samples. DDP takes care of gradient communication
    to keep model replicas synchronized and overlaps it with the gradient computations
    to speed up training.'
+  id: totrans-8
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
@@ -46,6 +55,7 @@
    and combinations of DDP with other training paradigms. It helps manage remote
    object lifetime and extends the [autograd engine](https://pytorch.org/docs/stable/autograd.html)
    beyond machine boundaries.'
+  id: totrans-9
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
@@ -67,60 +77,78 @@
    it also gives up the performance optimizations offered by DDP. [Writing Distributed
    Applications with PyTorch](../intermediate/dist_tuto.html) shows examples of using
    c10d communication APIs.'
+  id: totrans-10
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
 - en: Data Parallel Training
+  id: totrans-11
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 数据并行训练
 - en: 'PyTorch provides several options for data-parallel training. For applications
    that gradually grow from simple to complex and from prototype to production, the
    common development trajectory would be:'
+  id: totrans-12
  prefs: []
  type: TYPE_NORMAL
+  zh: PyTorch提供了几种数据并行训练的选项。对于从简单到复杂、从原型到生产逐渐增长的应用程序，常见的开发轨迹是：
 - en: Use single-device training if the data and model can fit in one GPU, and training
    speed is not a concern.
+  id: totrans-13
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 如果数据和模型可以适应一个GPU，并且训练速度不是问题，可以使用单设备训练。
 - en: Use single-machine multi-GPU [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
    to make use of multiple GPUs on a single machine to speed up training with minimal
    code changes.
+  id: totrans-14
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 使用单机多GPU [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
+    来利用单台机器上的多个GPU加速训练，只需进行最少的代码更改。
 - en: Use single-machine multi-GPU [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html),
    if you would like to further speed up training and are willing to write a little
    more code to set it up.
+  id: totrans-15
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 如果您希望进一步加快训练速度并愿意写更多代码来设置，可以使用单机多GPU [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)。
 - en: Use multi-machine [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
    and the [launching script](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md),
    if the application needs to scale across machine boundaries.
+  id: totrans-16
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: Use multi-GPU [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)
    training on a single-machine or multi-machine when the data and model cannot fit
    on one GPU.
+  id: totrans-17
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: Use [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
    to launch distributed training if errors (e.g., out-of-memory) are expected or
    if resources can join and leave dynamically during training.
+  id: totrans-18
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: Note
+  id: totrans-19
  prefs: []
  type: TYPE_NORMAL
 - en: Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus).
+  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
 - en: '`torch.nn.DataParallel`'
+  id: totrans-21
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
@@ -132,9 +160,11 @@
    performance because it replicates the model in every forward pass, and its single-process
    multi-thread parallelism naturally suffers from [GIL](https://wiki.python.org/moin/GlobalInterpreterLock)
    contention. To get better performance, consider using [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html).'
+  id: totrans-22
  prefs: []
  type: TYPE_NORMAL
 - en: '`torch.nn.parallel.DistributedDataParallel`'
+  id: totrans-23
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
@@ -146,43 +176,58 @@
    of in every forward pass, which also helps to speed up training. DDP is shipped
    with several performance optimization technologies. For a more in-depth explanation,
    refer to this [paper](http://www.vldb.org/pvldb/vol13/p3005-li.pdf) (VLDB’20).
+  id: totrans-24
  prefs: []
  type: TYPE_NORMAL
+  zh: 与[DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)相比，[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)需要多一步设置，即调用[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)。
+    DDP使用多进程并行，因此模型副本之间没有GIL争用。此外，模型在DDP构建时进行广播，而不是在每次前向传递中进行广播，这也有助于加快训练速度。 DDP配备了几种性能优化技术。有关更深入的解释，请参考这篇[论文](http://www.vldb.org/pvldb/vol13/p3005-li.pdf)（VLDB’20）。
 - en: 'DDP materials are listed below:'
+  id: totrans-25
  prefs: []
  type: TYPE_NORMAL
+  zh: DDP材料如下：
 - en: '[DDP notes](https://pytorch.org/docs/stable/notes/ddp.html) offer a starter
    example and some brief descriptions of its design and implementation. If this
    is your first time using DDP, start from this document.'
+  id: totrans-26
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[DDP笔记](https://pytorch.org/docs/stable/notes/ddp.html) 提供了一个入门示例以及对其设计和实现的简要描述。如果这是您第一次使用DDP，请从这个文档开始。'
 - en: '[Getting Started with Distributed Data Parallel](../intermediate/ddp_tutorial.html)
    explains some common problems with DDP training, including unbalanced workload,
    checkpointing, and multi-device models. Note that, DDP can be easily combined
    with single-machine multi-device model parallelism which is described in the [Single-Machine
    Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html) tutorial.'
+  id: totrans-27
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用分布式数据并行开始](../intermediate/ddp_tutorial.html) 解释了DDP训练中的一些常见问题，包括负载不平衡、检查点和多设备模型。请注意，DDP可以很容易地与单机多设备模型并行结合，该模型并行在[单机模型并行最佳实践](../intermediate/model_parallel_tutorial.html)教程中有描述。'
 - en: The [Launching and configuring distributed data parallel applications](https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md)
    document shows how to use the DDP launching script.
+  id: totrans-28
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[启动和配置分布式数据并行应用程序](https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md)
+    文档展示了如何使用DDP启动脚本。'
 - en: The [Shard Optimizer States With ZeroRedundancyOptimizer](../recipes/zero_redundancy_optimizer.html)
    recipe demonstrates how [ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html)
    helps to reduce optimizer memory footprint.
+  id: totrans-29
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: The [Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
    tutorial walks through using the generic join context for distributed training
    with uneven inputs.
+  id: totrans-30
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: '`torch.distributed.FullyShardedDataParallel`'
+  id: totrans-31
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
@@ -192,9 +237,11 @@
    data-parallel workers. The support for FSDP was added starting PyTorch v1.11\.
    The tutorial [Getting Started with FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
    provides in depth explanation and example of how FSDP works.
+  id: totrans-32
  prefs: []
  type: TYPE_NORMAL
 - en: torch.distributed.elastic
+  id: totrans-33
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
@@ -208,9 +255,11 @@
    (mismatched `AllReduce` operations) which would then cause a crash or hang. [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
    adds fault tolerance and the ability to make use of a dynamic pool of machines
    (elasticity).
+  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
 - en: RPC-Based Distributed Training
+  id: totrans-35
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
@@ -218,20 +267,24 @@
    paradigm, distributed pipeline parallelism, reinforcement learning applications
    with multiple observers or agents, etc. [torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
    aims at supporting general distributed training scenarios.
+  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
 - en: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) has four
    main pillars:'
+  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
 - en: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) supports running a given
    function on a remote worker.'
+  id: totrans-38
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
 - en: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) helps to manage the lifetime
    of a remote object. The reference counting protocol is presented in the [RRef
    notes](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol).'
+  id: totrans-39
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
@@ -239,28 +292,33 @@
    extends the autograd engine beyond machine boundaries. Please refer to [Distributed
    Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)
    for more details.'
+  id: totrans-40
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
 - en: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
    automatically reaches out to all participating workers to update parameters using
    gradients computed by the distributed autograd engine.'
+  id: totrans-41
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
 - en: 'RPC Tutorials are listed below:'
+  id: totrans-42
  prefs: []
  type: TYPE_NORMAL
 - en: The [Getting Started with Distributed RPC Framework](../intermediate/rpc_tutorial.html)
    tutorial first uses a simple Reinforcement Learning (RL) example to demonstrate
    RPC and RRef. Then, it applies a basic distributed model parallelism to an RNN
    example to show how to use distributed autograd and distributed optimizer.
+  id: totrans-43
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: The [Implementing a Parameter Server Using Distributed RPC Framework](../intermediate/rpc_param_server_tutorial.html)
    tutorial borrows the spirit of [HogWild! training](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)
    and applies it to an asynchronous parameter server (PS) training application.
+  id: totrans-44
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
@@ -268,6 +326,7 @@
    tutorial extends the single-machine pipeline parallel example (presented in [Single-Machine
    Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html))
    to a distributed environment and shows how to implement it using RPC.
+  id: totrans-45
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
@@ -275,20 +334,24 @@
    tutorial demonstrates how to implement RPC batch processing using the [@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)
    decorator, which can help speed up inference and training. It uses RL and PS examples
    similar to those in the above tutorials 1 and 2.
+  id: totrans-46
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: The [Combining Distributed DataParallel with Distributed RPC Framework](../advanced/rpc_ddp_tutorial.html)
    tutorial demonstrates how to combine DDP with RPC to train a model using distributed
    data parallelism combined with distributed model parallelism.
+  id: totrans-47
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
 - en: PyTorch Distributed Developers
+  id: totrans-48
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
 - en: If you’d like to contribute to PyTorch Distributed, please refer to our [Developer
    Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
+  id: totrans-49
  prefs: []
  type: TYPE_NORMAL
--- a/totrans/tut22_115.yaml
+++ b/totrans/tut22_115.yaml
@@ -15,7 +15,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_116.yaml
+++ b/totrans/tut22_116.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_tuto.rst).'
  prefs: []
  type: TYPE_NORMAL
@@ -68,7 +68,7 @@
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
- en: '[![Send and Recv](../Images/f29264b289639882a61fb5c3447b1ecc.png)](../_images/send_recv.png)'
+- en: '![Send and Recv](../Images/f29264b289639882a61fb5c3447b1ecc.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Send and Recv
@@ -126,13 +126,13 @@
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
- en: '| [![Scatter](../Images/3aa3584628cb0526c8b0e9d02b15d876.png)](../_images/scatter.png)'
+- en: '| ![Scatter](../Images/3aa3584628cb0526c8b0e9d02b15d876.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Scatter
  prefs: []
  type: TYPE_NORMAL
- en: '| [![Gather](../Images/7e8670a3b7cdc7848394514ef1da090a.png)](../_images/gather.png)'
+- en: '| ![Gather](../Images/7e8670a3b7cdc7848394514ef1da090a.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Gather
@@ -141,13 +141,13 @@
 - en: '|'
  prefs: []
  type: TYPE_NORMAL
- en: '| [![Reduce](../Images/1c451df4406aea85e640d1ae7df6df31.png)](../_images/reduce.png)'
+- en: '| ![Reduce](../Images/1c451df4406aea85e640d1ae7df6df31.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Reduce
  prefs: []
  type: TYPE_NORMAL
- en: '| [![All-Reduce](../Images/0ef9693f0008d5a75aa5ac2b542b83ac.png)](../_images/all_reduce.png)'
+- en: '| ![All-Reduce](../Images/0ef9693f0008d5a75aa5ac2b542b83ac.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: All-Reduce
@@ -156,13 +156,13 @@
 - en: '|'
  prefs: []
  type: TYPE_NORMAL
- en: '| [![Broadcast](../Images/525847c9d4b48933cb231204a2d13e0e.png)](../_images/broadcast.png)'
+- en: '| ![Broadcast](../Images/525847c9d4b48933cb231204a2d13e0e.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Broadcast
  prefs: []
  type: TYPE_NORMAL
- en: '| [![All-Gather](../Images/4a48977cd9545f897942a4a4ef1175ac.png)](../_images/all_gather.png)'
+- en: '| ![All-Gather](../Images/4a48977cd9545f897942a4a4ef1175ac.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: All-Gather

--- a/totrans/tut22_117.yaml
+++ b/totrans/tut22_117.yaml
@@ -13,7 +13,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL
@@ -49,7 +49,7 @@
    reduced by internal optimizations like overlapping communication and computation.
  prefs: []
  type: TYPE_NORMAL
- en: '[![FSDP workflow](../Images/4e33f1b27db65dbfcbcf54cce427e858.png)](../_images/fsdp_workflow.png)'
+- en: '![FSDP workflow](../Images/4e33f1b27db65dbfcbcf54cce427e858.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: FSDP Workflow
@@ -109,7 +109,7 @@
    to collect and combine the updated parameter shards.
  prefs: []
  type: TYPE_NORMAL
- en: '[![FSDP allreduce](../Images/0e1d2209fe5b011d7237cb607289d4f1.png)](../_images/fsdp_sharding.png)'
+- en: '![FSDP allreduce](../Images/0e1d2209fe5b011d7237cb607289d4f1.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: FSDP Allreduce
@@ -210,7 +210,7 @@
    AWS EC2 instance with 4 GPUs captured from PyTorch Profiler.
  prefs: []
  type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/c26c3d052bcb9f32ea5c7b3d9500d97a.png)](../_images/FSDP_memory.gif)'
+- en: '![FSDP peak memory](../Images/c26c3d052bcb9f32ea5c7b3d9500d97a.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: FSDP Peak Memory Usage
@@ -265,7 +265,7 @@
    compared to FSDP without auto wrap policy applied, from ~75 MB to 66 MB.
  prefs: []
  type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/62842d10a3954d2d247fca536a0d7bfe.png)](../_images/FSDP_autowrap.gif)'
+- en: '![FSDP peak memory](../Images/62842d10a3954d2d247fca536a0d7bfe.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: FSDP Peak Memory Usage using Auto_wrap policy
@@ -309,7 +309,7 @@
    AWS EC2 instance with 4 GPUs captured from PyTorch profiler.
  prefs: []
  type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/b7af7a69ededd6326e3de004bb7b1e43.png)](../_images/DDP_memory.gif)'
+- en: '![FSDP peak memory](../Images/b7af7a69ededd6326e3de004bb7b1e43.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: DDP Peak Memory Usage using Auto_wrap policy

--- a/totrans/tut22_119.yaml
+++ b/totrans/tut22_119.yaml
@@ -13,7 +13,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_120.yaml
+++ b/totrans/tut22_120.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_121.yaml
+++ b/totrans/tut22_121.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_param_server_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_122.yaml
+++ b/totrans/tut22_122.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_pipeline_parallel_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_123.yaml
+++ b/totrans/tut22_123.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_async_execution.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_124.yaml
+++ b/totrans/tut22_124.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/rpc_ddp_tutorial.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_127.yaml
+++ b/totrans/tut22_127.yaml
@@ -12,7 +12,7 @@
 - en: Note
  prefs: []
  type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
+- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
    View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/generic_join.rst).'
  prefs: []
  type: TYPE_NORMAL

--- a/totrans/tut22_129.yaml
+++ b/totrans/tut22_129.yaml
@@ -261,8 +261,8 @@
    see the following screens:'
  prefs: []
  type: TYPE_NORMAL
- en: '[![../_images/deeplabv3_ios.png](../Images/9ac919407ef21251c34a31f8fc79bd32.png)](../_images/deeplabv3_ios.png)
-    [![../_images/deeplabv3_ios2.png](../Images/48e025cda7e2c4c6a8cfe2a933cfd4f0.png)](../_images/deeplabv3_ios2.png)'
+- en: '![../_images/deeplabv3_ios.png](../Images/9ac919407ef21251c34a31f8fc79bd32.png)
+    ![../_images/deeplabv3_ios2.png](../Images/48e025cda7e2c4c6a8cfe2a933cfd4f0.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Recap

--- a/totrans/tut22_130.yaml
+++ b/totrans/tut22_130.yaml
@@ -278,8 +278,8 @@
    you will see screens like the following:'
  prefs: []
  type: TYPE_NORMAL
- en: '[![../_images/deeplabv3_android.png](../Images/1b0ecd17a6617abde8eb2e7e3409bbd0.png)](../_images/deeplabv3_android.png)
-    [![../_images/deeplabv3_android2.png](../Images/01e9b7b7725f15ac40b77b270306d4f8.png)](../_images/deeplabv3_android2.png)'
+- en: '![../_images/deeplabv3_android.png](../Images/1b0ecd17a6617abde8eb2e7e3409bbd0.png)
+    ![../_images/deeplabv3_android2.png](../Images/01e9b7b7725f15ac40b77b270306d4f8.png)'
  prefs: []
  type: TYPE_NORMAL
 - en: Recap