提交 26c2d4ae 编写于 作者: 绝不原创的飞龙's avatar 绝不原创的飞龙

2024-02-04 13:17:19

上级 6a912f80
......@@ -212,11 +212,11 @@
prefs: []
type: TYPE_NORMAL
zh: 该函数返回一个数据框的元组。第一个数据框包含每个流中每个等级的类别空闲时间。
- en: '[![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)](../_images/idle_time.png)'
- en: '![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)'
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)](../_images/idle_time.png)'
zh: '![../_images/idle_time.png](../Images/804d1bbaf4c125dff21648945b3082ff.png)'
- en: The second dataframe is generated when `show_idle_interval_stats` is set to
`True`. It contains the summary statistics of the idle time for each stream on
each rank.
......@@ -224,11 +224,11 @@
prefs: []
type: TYPE_NORMAL
zh: 第二个数据框是在将`show_idle_interval_stats`设置为`True`时生成的。它包含每个流在每个rank上的空闲时间的摘要统计信息。
- en: '[![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)](../_images/idle_time_summary.png)'
- en: '![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)'
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)](../_images/idle_time_summary.png)'
zh: '![../_images/idle_time_summary.png](../Images/0d0f42e11aa0c33b2fe4b1b2dcdc3d20.png)'
- en: Tip
id: totrans-37
prefs: []
......@@ -412,11 +412,11 @@
prefs: []
type: TYPE_NORMAL
zh: 该函数返回一个包含每个rank的重叠百分比的数据框。
- en: '[![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)](../_images/overlap_df.png)'
- en: '![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)'
id: totrans-66
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)](../_images/overlap_df.png)'
zh: '![../_images/overlap_df.png](../Images/22a0d906eede5591c1d5935dba1324f4.png)'
- en: When the `visualize` argument is set to True, the [get_comm_comp_overlap](https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_comm_comp_overlap)
function also generates a bar graph representing the overlap by rank.
id: totrans-67
......@@ -474,11 +474,11 @@
prefs: []
type: TYPE_NORMAL
zh: 生成的带有增强计数器的跟踪文件的屏幕截图。
- en: '[![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)](../_images/mem_bandwidth_queue_length.png)'
- en: '![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)'
id: totrans-75
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)](../_images/mem_bandwidth_queue_length.png)'
zh: '![../_images/mem_bandwidth_queue_length.png](../Images/7b09c2f07fe7daff2c296c3c17fec795.png)'
- en: 'HTA also provides a summary of the memory copy bandwidth and queue length counters
as well as the time series of the counters for the profiled portion of the code
using the following API:'
......@@ -526,11 +526,11 @@
prefs: []
type: TYPE_NORMAL
zh: 摘要包含计数、最小值、最大值、平均值、标准差、25th、50th和75th百分位数。
- en: '[![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)](../_images/queue_length_summary.png)'
- en: '![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)'
id: totrans-84
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)](../_images/queue_length_summary.png)'
zh: '![../_images/queue_length_summary.png](../Images/c176e0b671c636afdb57c7dcde4ec7b2.png)'
- en: The time series only contains the points when a value changes. Once a value
is observed the time series stays constant until the next update. The memory bandwidth
and queue length time series functions return a dictionary whose key is the rank
......@@ -572,11 +572,11 @@
prefs: []
type: TYPE_NORMAL
zh: 下面给出了生成的数据框的屏幕截图。
- en: '[![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)](../_images/cuda_kernel_launch_stats.png)'
- en: '![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)'
id: totrans-91
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)](../_images/cuda_kernel_launch_stats.png)'
zh: '![../_images/cuda_kernel_launch_stats.png](../Images/f08d3cd24db3c350255e51c1217848bf.png)'
- en: 'The duration of the CPU op, GPU kernel, and the launch delay allow us to find
the following:'
id: totrans-92
......
......@@ -90,11 +90,11 @@
prefs: []
type: TYPE_NORMAL
zh: 我们可以看到对于x的梯度本身是x的函数(dout/dx = 2x),并且这个函数的图形已经正确构建
- en: '[![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)](https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png)'
- en: '![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)'
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: '[![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)](https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png)'
zh: '![https://user-images.githubusercontent.com/13428986/126559699-e04f3cb1-aaf2-4a9a-a83d-b8767d04fbd9.png](../Images/664c9393ebdb32f044c3ab5f5780b3f7.png)'
- en: Saving the Outputs
id: totrans-15
prefs:
......@@ -122,11 +122,11 @@
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: '[![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)](https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png)'
- en: '![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)'
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: '[![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)](https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png)'
zh: '![https://user-images.githubusercontent.com/13428986/126559780-d141f2ba-1ee8-4c33-b4eb-c9877b27a954.png](../Images/7ab379f6d65d456373fdf6a3cdb35b1a.png)'
- en: Saving Intermediate Results
id: totrans-21
prefs:
......@@ -174,11 +174,11 @@
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: '[![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)](https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png)'
- en: '![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)'
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: '[![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)](https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png)'
zh: '![https://user-images.githubusercontent.com/13428986/126560494-e48eba62-be84-4b29-8c90-a7f6f40b1438.png](../Images/66f87d1f09778a82307fefa72409569c.png)'
- en: 'Saving Intermediate Results: What not to do'
id: totrans-30
prefs:
......@@ -207,11 +207,11 @@
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: '[![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)](https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png)'
- en: '![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)'
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: '[![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)](https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png)'
zh: '![https://user-images.githubusercontent.com/13428986/126565889-13992f01-55bc-411a-8aee-05b721fe064a.png](../Images/c57a22a13ed99e177d45732c5bcc36ff.png)'
- en: When Backward is not Tracked
id: totrans-36
prefs:
......@@ -242,11 +242,11 @@
prefs: []
type: TYPE_PRE
zh: '[PRE9]'
- en: '[![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)](https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png)'
- en: '![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)'
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: '[![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)](https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png)'
zh: '![https://user-images.githubusercontent.com/13428986/126559935-74526b4d-d419-4983-b1f0-a6ee99428531.png](../Images/44368555f30978a287e8a47e0cfff9ee.png)'
- en: To conclude, whether double backward works for your custom function simply depends
on whether the backward pass can be tracked by autograd. With the first two examples
we show situations where double backward works out of the box. With the third
......
......@@ -317,11 +317,11 @@
- PREF_UL
type: TYPE_NORMAL
zh: 概述
- en: '[![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)](../_static/img/profiler_overview1.png)'
- en: '![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)'
id: totrans-53
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)](../_static/img/profiler_overview1.png)'
zh: '![../_static/img/profiler_overview1.png](../Images/7bf5bbd17de6da63afc38b29b8c8f0d2.png)'
- en: The overview shows a high-level summary of model performance.
id: totrans-54
prefs: []
......@@ -369,11 +369,11 @@
prefs: []
type: TYPE_NORMAL
zh: 操作员视图显示了在主机或设备上执行的每个PyTorch操作员的性能。
- en: '[![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)](../_static/img/profiler_operator_view.png)'
- en: '![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)'
id: totrans-62
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)](../_static/img/profiler_operator_view.png)'
zh: '![../_static/img/profiler_operator_view.png](../Images/4fae99315367a1998f977b76a2fc6526.png)'
- en: The “Self” duration does not include its child operators’ time. The “Total”
duration includes its child operators’ time.
id: totrans-63
......@@ -393,22 +393,22 @@
prefs: []
type: TYPE_NORMAL
zh: 单击操作员的“查看调用堆栈”,将显示具有相同名称但不同调用堆栈的操作员。然后单击此子表中的“查看调用堆栈”,将显示调用堆栈帧。
- en: '[![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)](../_static/img/profiler_callstack.png)'
- en: '![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)'
id: totrans-66
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)](../_static/img/profiler_callstack.png)'
zh: '![../_static/img/profiler_callstack.png](../Images/0d8e7045d34fb23f544d1fdb71ccb79b.png)'
- en: If the TensorBoard is launched inside VS Code ([Launch Guide](https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/#tensorboard-integration)),
clicking a call stack frame will navigate to the specific code line.
id: totrans-67
prefs: []
type: TYPE_NORMAL
zh: 如果在VS Code中启动了TensorBoard([启动指南](https://devblogs.microsoft.com/python/python-in-visual-studio-code-february-2021-release/#tensorboard-integration)),单击调用堆栈帧将导航到特定的代码行。
- en: '[![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)](../_static/img/profiler_vscode.png)'
- en: '![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)'
id: totrans-68
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)](../_static/img/profiler_vscode.png)'
zh: '![../_static/img/profiler_vscode.png](../Images/75f42648d12a47e893905f678287a967.png)'
- en: Kernel view
id: totrans-69
prefs:
......@@ -420,11 +420,11 @@
prefs: []
type: TYPE_NORMAL
zh: GPU内核视图显示GPU上花费的所有内核时间。
- en: '[![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)](../_static/img/profiler_kernel_view.png)'
- en: '![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)'
id: totrans-71
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)](../_static/img/profiler_kernel_view.png)'
zh: '![../_static/img/profiler_kernel_view.png](../Images/5122dd95514210b1325de9e54574173f.png)'
- en: 'Tensor Cores Used: Whether this kernel uses Tensor Cores.'
id: totrans-72
prefs: []
......@@ -458,11 +458,11 @@
prefs: []
type: TYPE_NORMAL
zh: 跟踪视图显示了受监视的操作员和GPU内核的时间轴。您可以选择它以查看以下详细信息。
- en: '[![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)](../_static/img/profiler_trace_view1.png)'
- en: '![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)'
id: totrans-77
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)](../_static/img/profiler_trace_view1.png)'
zh: '![../_static/img/profiler_trace_view1.png](../Images/be1bf500afaf7c10bd7f7f8a30fa1ef9.png)'
- en: You can move the graph and zoom in/out with the help of right side toolbar.
And keyboard can also be used to zoom and move around inside the timeline. The
‘w’ and ‘s’ keys zoom in centered around the mouse, and the ‘a’ and ‘d’ keys move
......@@ -478,11 +478,11 @@
prefs: []
type: TYPE_NORMAL
zh: 如果后向操作员的“传入流”字段的值为“前向对应后向”,则可以单击文本以获取其启动前向操作员。
- en: '[![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)](../_static/img/profiler_trace_view_fwd_bwd.png)'
- en: '![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)'
id: totrans-80
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)](../_static/img/profiler_trace_view_fwd_bwd.png)'
zh: '![../_static/img/profiler_trace_view_fwd_bwd.png](../Images/cb82608044c7382139065f9e79f1a99d.png)'
- en: In this example, we can see the event prefixed with `enumerate(DataLoader)`
costs a lot of time. And during most of this period, the GPU is idle. Because
this function is loading data and transforming data on host side, during which
......@@ -523,22 +523,22 @@
prefs: []
type: TYPE_NORMAL
zh: 然后在左侧的“Runs”下拉列表中选择最近分析的运行。
- en: '[![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)](../_static/img/profiler_overview2.png)'
- en: '![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)'
id: totrans-87
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)](../_static/img/profiler_overview2.png)'
zh: '![../_static/img/profiler_overview2.png](../Images/837967744e5997b8debc071b27685596.png)'
- en: From the above view, we can find the step time is reduced to about 76ms comparing
with previous run’s 132ms, and the time reduction of `DataLoader` mainly contributes.
id: totrans-88
prefs: []
type: TYPE_NORMAL
zh: 从上述视图中,我们可以看到步骤时间与之前的运行相比减少到约76ms,而`DataLoader`的时间减少主要起作用。
- en: '[![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)](../_static/img/profiler_trace_view2.png)'
- en: '![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)'
id: totrans-89
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)](../_static/img/profiler_trace_view2.png)'
zh: '![../_static/img/profiler_trace_view2.png](../Images/9126a2827ef47b32d4dd38a1e813505e.png)'
- en: From the above view, we can see that the runtime of `enumerate(DataLoader)`
is reduced, and the GPU utilization is increased.
id: totrans-90
......@@ -579,11 +579,11 @@
prefs: []
type: TYPE_NORMAL
zh: 分析器在分析过程中记录所有内存分配/释放事件和分配器的内部状态。内存视图由以下三个组件组成。
- en: '[![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)](../_static/img/profiler_memory_view.png)'
- en: '![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)'
id: totrans-97
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)](../_static/img/profiler_memory_view.png)'
zh: '![../_static/img/profiler_memory_view.png](../Images/c6251499e3b25e142059d0e53c1c3007.png)'
- en: The components are memory curve graph, memory events table and memory statistics
table, from top to bottom, respectively.
id: totrans-98
......@@ -606,11 +606,11 @@
prefs: []
type: TYPE_NORMAL
zh: 内存曲线显示内存消耗的趋势。“已分配”曲线显示实际使用的总内存,例如张量。在PyTorch中,CUDA分配器和一些其他分配器采用了缓存机制。“保留”曲线显示分配器保留的总内存。您可以在图表上左键单击并拖动以选择所需范围内的事件:
- en: '[![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)](../_static/img/profiler_memory_curve_selecting.png)'
- en: '![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)'
id: totrans-101
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)](../_static/img/profiler_memory_curve_selecting.png)'
zh: '![../_static/img/profiler_memory_curve_selecting.png](../Images/e9ec73bd94cda9e0afe2f7d66988efb3.png)'
- en: After selection, the three components will be updated for the restricted time
range, so that you can gain more information about it. By repeating this process,
you can zoom into a very fine-grained detail. Right click on the graph will reset
......@@ -619,11 +619,11 @@
prefs: []
type: TYPE_NORMAL
zh: 选择后,这三个组件将针对受限时间范围进行更新,以便您可以获取更多信息。通过重复这个过程,您可以深入了解非常细微的细节。右键单击图表将重置图表到初始状态。
- en: '[![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)](../_static/img/profiler_memory_curve_single.png)'
- en: '![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)'
id: totrans-103
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)](../_static/img/profiler_memory_curve_single.png)'
zh: '![../_static/img/profiler_memory_curve_single.png](../Images/b34a9076e55573e9c29e772fd4fc8238.png)'
- en: In the memory events table, the allocation and release events are paired into
one entry. The “operator” column shows the immediate ATen operator that is causing
the allocation. Notice that in PyTorch, ATen operators commonly use `aten::empty`
......@@ -672,11 +672,11 @@
prefs: []
type: TYPE_PRE
zh: '[PRE12]'
- en: '[![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)](../_static/img/profiler_distributed_view.png)'
- en: '![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)'
id: totrans-110
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)](../_static/img/profiler_distributed_view.png)'
zh: '![../_static/img/profiler_distributed_view.png](../Images/bc5ec09af445c3714c07c9bc3c7fb515.png)'
- en: The “Computation/Communication Overview” shows computation/communication ratio
and their overlapping degree. From this view, User can figure out load balance
issue among workers. For example, if the computation + overlapping time of one
......@@ -810,11 +810,11 @@
prefs: []
type: TYPE_NORMAL
zh: 选择不同的视图,如**步骤4**中所述。例如,下面是**操作员**视图:
- en: '[![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)](../_static/img/profiler_rocm_tensorboard_operartor_view.png)'
- en: '![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)'
id: totrans-133
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)](../_static/img/profiler_rocm_tensorboard_operartor_view.png)'
zh: '![../_static/img/profiler_rocm_tensorboard_operartor_view.png](../Images/766def45c853a562ade085a166bc7a98.png)'
- en: At the time this section is written, **Trace** view does not work and it displays
nothing. You can work around by typing `chrome://tracing` in your Chrome Browser.
id: totrans-134
......@@ -841,11 +841,11 @@
- PREF_UL
type: TYPE_NORMAL
zh: 点击**加载**按钮,从浏览器中的`chrome://tracing`页面加载跟踪JSON文件。
- en: '[![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)](../_static/img/profiler_rocm_chrome_trace_view.png)'
- en: '![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)'
id: totrans-138
prefs: []
type: TYPE_NORMAL
zh: '[![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)](../_static/img/profiler_rocm_chrome_trace_view.png)'
zh: '![../_static/img/profiler_rocm_chrome_trace_view.png](../Images/576f0fdbe384c09bd227cc973cbf6ecd.png)'
- en: As mentioned previously, you can move the graph and zoom in and out. You can
also use keyboard to zoom and move around inside the timeline. The `w` and `s`
keys zoom in centered around the mouse, and the `a` and `d` keys move the timeline
......
此差异已折叠。
......@@ -35,11 +35,11 @@
type: TYPE_NORMAL
zh: 在本教程中,我们将通过[英特尔® PyTorch*扩展启动器](https://github.com/intel/intel-extension-for-pytorch/blob/master/docs/tutorials/performance_tuning/launch_script.md)演示如何通过内存分配器提高性能,并通过[英特尔®
PyTorch*扩展](https://github.com/intel/intel-extension-for-pytorch)在CPU上优化内核,并将它们应用于TorchServe,展示ResNet50的吞吐量提升了7.71倍,BERT的吞吐量提升了2.20倍。
- en: '[![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)](../_images/1.png)'
- en: '![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)](../_images/1.png)'
zh: '![../_images/1.png](../Images/74cc44a62474337c4fc6d0bc99098db9.png)'
- en: Prerequisites
id: totrans-6
prefs:
......@@ -81,11 +81,11 @@
prefs: []
type: TYPE_NORMAL
zh: 在调整CPU以获得最佳性能时,了解瓶颈所在是很有用的。大多数CPU核心都有芯片上的性能监控单元(PMUs)。PMUs是CPU核心内的专用逻辑单元,用于计算系统上发生的特定硬件事件。这些事件的示例可能是缓存未命中或分支误预测。PMUs用于自顶向下的微体系结构分析(TMA)以识别瓶颈。TMA包括如下层次结构:
- en: '[![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)](../_images/26.png)'
- en: '![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)'
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)](../_images/26.png)'
zh: '![../_images/26.png](../Images/cd7487204a4dc972818b86076a766477.png)'
- en: 'The top level, level-1, metrics collect *Retiring*, *Bad Speculation*, *Front
End Bound*, *Back End Bound*. The pipeline of CPU can conceptually be simplified
and divided into two: the frontend and the backend. The *frontend* is responsible
......@@ -280,11 +280,11 @@
prefs: []
type: TYPE_NORMAL
zh: 让我们收集一级 TMA 指标。
- en: '[![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)](../_images/32.png)'
- en: '![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)'
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)](../_images/32.png)'
zh: '![../_images/32.png](../Images/92b08c13c370b320e540a278fa5c05a3.png)'
- en: Level-1 TMA shows that both PTMalloc and JeMalloc are bounded by the backend.
More than half of the execution time was stalled by the backend. Let’s go one
level deeper.
......@@ -292,22 +292,22 @@
prefs: []
type: TYPE_NORMAL
zh: 一级 TMA 显示 PTMalloc 和 JeMalloc 都受后端限制。超过一半的执行时间被后端阻塞。让我们再深入一层。
- en: '[![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)](../_images/41.png)'
- en: '![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)'
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)](../_images/41.png)'
zh: '![../_images/41.png](../Images/4c4e1618b6431d1689b0e0c84efef1d3.png)'
- en: Level-2 TMA shows that the Back End Bound was caused by Memory Bound. Let’s
go one level deeper.
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 二级 TMA 显示后端受限是由内存受限引起的。让我们再深入一层。
- en: '[![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)](../_images/51.png)'
- en: '![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)'
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)](../_images/51.png)'
zh: '![../_images/51.png](../Images/20c7abc0c93bd57ad7a3dd3655c2f294.png)'
- en: Most of the metrics under the Memory Bound identify which level of the memory
hierarchy from the L1 cache to main memory is the bottleneck. A hotspot bounded
at a given level indicates that most of the data was being retrieved from that
......@@ -326,11 +326,11 @@
prefs: []
type: TYPE_NORMAL
zh: 让我们看看 Intel® VTune Profiler ITT 跟踪。在示例脚本中,我们已经注释了推理循环的每个 *step_x*。
- en: '[![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)](../_images/61.png)'
- en: '![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)'
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)](../_images/61.png)'
zh: '![../_images/61.png](../Images/ecb098b61b1ff2137e4d3a18ab854273.png)'
- en: Each step is traced in the timeline graph. The duration of model inference on
the last step (step_99) decreased from 304.308 ms to 261.843 ms.
id: totrans-45
......@@ -390,21 +390,21 @@
prefs: []
type: TYPE_NORMAL
zh: 让我们收集一级 TMA 指标。
- en: '[![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)](../_images/71.png)'
- en: '![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)'
id: totrans-55
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)](../_images/71.png)'
zh: '![../_images/71.png](../Images/5f87dd66115567eb48eaa61c41e17b25.png)'
- en: Let’s go one level deeper.
id: totrans-56
prefs: []
type: TYPE_NORMAL
zh: 让我们再深入一层。
- en: '[![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)](../_images/81.png)'
- en: '![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)'
id: totrans-57
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)](../_images/81.png)'
zh: '![../_images/81.png](../Images/23960c03e87ed20b8dc0954e8d420e20.png)'
- en: Let’s use Intel® VTune Profiler ITT to annotate [TorchServe inference scope](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L188)
to profile at inference-level granularity. As [TorchServe Architecture](https://github.com/pytorch/serve/blob/master/docs/internals.md#torchserve-architecture)
consists of several sub-components, including the Java frontend for handling request/response,
......@@ -417,22 +417,22 @@
进行注释,以便以推理级别的粒度进行分析。由于 [TorchServe 架构](https://github.com/pytorch/serve/blob/master/docs/internals.md#torchserve-architecture)
包括几个子组件,包括用于处理请求/响应的 Java 前端和用于在模型上运行实际推理的 Python 后端,因此使用 Intel® VTune Profiler
ITT 限制在推理级别收集跟踪数据是有帮助的。
- en: '[![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)](../_images/9.png)'
- en: '![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)'
id: totrans-59
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)](../_images/9.png)'
zh: '![../_images/9.png](../Images/503e0ca767cc22bc9c9fcad2ebb78311.png)'
- en: Each inference call is traced in the timeline graph. The duration of the last
model inference decreased from 561.688 ms to 251.287 ms - 2.2x speedup.
id: totrans-60
prefs: []
type: TYPE_NORMAL
zh: 每个推断调用都在时间线图中被跟踪。最后一个模型推断的持续时间从561.688毫秒减少到251.287毫秒 - 加速2.2倍。
- en: '[![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)](../_images/101.png)'
- en: '![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)'
id: totrans-61
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)](../_images/101.png)'
zh: '![../_images/101.png](../Images/b028bfe554248a98ae3e2a0d6250a5f4.png)'
- en: The timeline graph can be expanded to see op-level profiling results. The duration
of *aten::conv2d* decreased from 16.401 ms to 6.392 ms - 2.6x speedup.
id: totrans-62
......@@ -594,21 +594,21 @@
prefs: []
type: TYPE_NORMAL
zh: 该模型由两个操作组成——Conv2d和ReLU。通过打印模型对象,我们得到以下输出。
- en: '[![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)](../_images/11.png)'
- en: '![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)'
id: totrans-89
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)](../_images/11.png)'
zh: '![../_images/11.png](../Images/80104f40ec5b9cc463c39342ea6908a7.png)'
- en: Let’s collect level-1 TMA metrics.
id: totrans-90
prefs: []
type: TYPE_NORMAL
zh: 让我们收集一级TMA指标。
- en: '[![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)](../_images/121.png)'
- en: '![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)'
id: totrans-91
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)](../_images/121.png)'
zh: '![../_images/121.png](../Images/81c6ba8b688066ce19f4d4a274996485.png)'
- en: Notice the Back End Bound reduced from 68.9 to 38.5 – 1.8x speedup.
id: totrans-92
prefs: []
......@@ -619,11 +619,11 @@
prefs: []
type: TYPE_NORMAL
zh: 此外,让我们使用PyTorch Profiler进行性能分析。
- en: '[![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)](../_images/131.png)'
- en: '![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)'
id: totrans-94
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)](../_images/131.png)'
zh: '![../_images/131.png](../Images/42c2d372d466f39f3cd12d8c9260c4d1.png)'
- en: Notice the CPU time reduced from 851 us to 310 us – 2.7X speedup.
id: totrans-95
prefs: []
......@@ -679,11 +679,11 @@
prefs: []
type: TYPE_NORMAL
zh: 让我们收集一级TMA指标。
- en: '[![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)](../_images/141.png)'
- en: '![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)'
id: totrans-103
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)](../_images/141.png)'
zh: '![../_images/141.png](../Images/6d467b7f5180ed2749ec54aa4196fab0.png)'
- en: Notice the Back End Bound reduced from 67.1 to 37.5 – 1.8x speedup.
id: totrans-104
prefs: []
......@@ -694,11 +694,11 @@
prefs: []
type: TYPE_NORMAL
zh: 此外,让我们使用PyTorch Profiler进行性能分析。
- en: '[![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)](../_images/151.png)'
- en: '![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)'
id: totrans-106
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)](../_images/151.png)'
zh: '![../_images/151.png](../Images/4d256b81c69edc40a7d9551d270c1b48.png)'
- en: 'Notice that with Intel® Extension for PyTorch* Conv + ReLU operators are fused,
and the CPU time reduced from 803 us to 248 us 3.2X speedup. The oneDNN eltwise
post-op enables fusing a primitive with an elementwise primitive. This is one
......@@ -765,22 +765,22 @@
prefs: []
type: TYPE_NORMAL
zh: 我们将使用[oneDNN详细模式](https://oneapi-src.github.io/oneDNN/dev_guide_verbose.html),这是一个帮助收集有关oneDNN图级别信息的工具,例如操作融合、执行oneDNN原语所花费的内核执行时间。有关更多信息,请参考[oneDNN文档](https://oneapi-src.github.io/oneDNN/index.html)。
- en: '[![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)](../_images/161.png)[![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)](../_images/171.png)'
- en: '![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)'
id: totrans-115
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)](../_images/161.png)[![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)](../_images/171.png)'
zh: '![../_images/161.png](../Images/52018fb59ebe653af37a7e977764e699.png)![../_images/171.png](../Images/466f4685aff041f4d0f735aaae4593d5.png)'
- en: Above is oneDNN verbose from channels first. We can verify that there are reorders
from weight and data, then do computation, and finally reorder output back.
id: totrans-116
prefs: []
type: TYPE_NORMAL
zh: 以上是来自通道首的oneDNN详细信息。我们可以验证从权重和数据进行重新排序,然后进行计算,最后将输出重新排序。
- en: '[![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)](../_images/181.png)'
- en: '![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)'
id: totrans-117
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)](../_images/181.png)'
zh: '![../_images/181.png](../Images/37fca356cd849c6f51a0b2d155565535.png)'
- en: Above is oneDNN verbose from channels last. We can verify that channels last
memory format avoids unnecessary reorders.
id: totrans-118
......@@ -799,11 +799,11 @@
prefs: []
type: TYPE_NORMAL
zh: 以下总结了TorchServe与Intel® Extension for PyTorch*在ResNet50和BERT-base-uncased上的性能提升。
- en: '[![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)](../_images/191.png)'
- en: '![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)'
id: totrans-121
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)](../_images/191.png)'
zh: '![../_images/191.png](../Images/70692f21eb61fe41d0b7a67d8ae8a54d.png)'
- en: Exercise with TorchServe
id: totrans-122
prefs:
......@@ -839,11 +839,11 @@
prefs: []
type: TYPE_NORMAL
zh: 让我们收集一级TMA指标。
- en: '[![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)](../_images/20.png)'
- en: '![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)'
id: totrans-128
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)](../_images/20.png)'
zh: '![../_images/20.png](../Images/90ff6d2c5ef9b558de384a1a802a7781.png)'
- en: Level-1 TMA shows that both are bounded by the backend. As discussed earlier,
the majority of untuned deep learning workloads will be Back End Bound. Notice
the Back End Bound reduced from 70.0 to 54.1\. Let’s go one level deeper.
......@@ -851,11 +851,11 @@
prefs: []
type: TYPE_NORMAL
zh: Level-1 TMA 显示两者都受到后端的限制。正如之前讨论的,大多数未调整的深度学习工作负载将受到后端的限制。注意后端限制从70.0降至54.1。让我们再深入一层。
- en: '[![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)](../_images/211.png)'
- en: '![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)'
id: totrans-130
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)](../_images/211.png)'
zh: '![../_images/211.png](../Images/7986245e21288e28a0d187026c929c7d.png)'
- en: As discussed earlier, Back End Bound has two submetrics – Memory Bound and Core
Bound. Memory Bound indicates the workload is under-optimized or under-utilized,
and ideally memory-bound operations can be improved to core-bound by optimizing
......@@ -866,11 +866,11 @@
type: TYPE_NORMAL
zh: 如前所述,后端绑定有两个子指标 - 内存绑定和核心绑定。内存绑定表示工作负载未经优化或未充分利用,理想情况下,内存绑定操作可以通过优化OPs和改善缓存局部性来改善为核心绑定。Level-2
TMA显示后端绑定从内存绑定改善为核心绑定。让我们深入一层。
- en: '[![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)](../_images/221.png)'
- en: '![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)'
id: totrans-132
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)](../_images/221.png)'
zh: '![../_images/221.png](../Images/694c98433e22e56db7436b3cc54c153a.png)'
- en: Scaling deep learning models for production on a model serving framework like
TorchServe requires high compute utilization. This requires that data is available
through prefetching and reusing the data in cache when the execution units need
......@@ -888,22 +888,22 @@
prefs: []
type: TYPE_NORMAL
zh: 与TorchServe之前的练习一样,让我们使用Intel® VTune Profiler ITT来注释[TorchServe推断范围](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py#L188),以便以推断级别的粒度进行分析。
- en: '[![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)](../_images/231.png)'
- en: '![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)'
id: totrans-135
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)](../_images/231.png)'
zh: '![../_images/231.png](../Images/7bac8eb911646b9101914663163d1958.png)'
- en: Each inference call is traced in the timeline graph. The duration of the last
inference call decreased from 215.731 ms to 95.634 ms - 2.3x speedup.
id: totrans-136
prefs: []
type: TYPE_NORMAL
zh: 时间轴图中跟踪了每个推断调用。最后一个推断调用的持续时间从215.731毫秒减少到95.634毫秒 - 2.3倍加速。
- en: '[![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)](../_images/241.png)'
- en: '![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)'
id: totrans-137
prefs: []
type: TYPE_NORMAL
zh: '[![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)](../_images/241.png)'
zh: '![../_images/241.png](../Images/6e43f3a1c63c0ea0223edb29e584267d.png)'
- en: The timeline graph can be expanded to see op-level profiling results. Notice
that Conv + ReLU has been fused, and the duration decreased from 6.393 ms + 1.731
ms to 3.408 ms - 2.4x speedup.
......
......@@ -477,14 +477,16 @@
prefs: []
type: TYPE_NORMAL
zh: 随意调整控制softmax函数软度和损失系数的温度参数。在神经网络中,很容易包含额外的损失函数到主要目标中,以实现更好的泛化。让我们尝试为学生包含一个目标,但现在让我们专注于他们的隐藏状态而不是输出层。我们的目标是通过包含一个天真的损失函数,使得随着损失的减少,传递给分类器的后续展平向量变得更加“相似”,从而将信息从教师的表示传达给学生。当然,教师不会更新其权重,因此最小化仅取决于学生的权重。这种方法背后的理念是,我们假设教师模型具有更好的内部表示,学生不太可能在没有外部干预的情况下实现,因此我们人为地推动学生模仿教师的内部表示。这是否最终会帮助学生并不明显,因为推动轻量级网络达到这一点可能是一件好事,假设我们已经找到了导致更好测试准确性的内部表示,但也可能是有害的,因为网络具有不同的架构,学生没有与教师相同的学习能力。换句话说,没有理由要求这两个向量,学生的和教师的,每个分量都匹配。学生可能达到教师的一个排列的内部表示,这样同样有效。尽管如此,我们仍然可以运行一个快速实验来了解这种方法的影响。我们将使用`CosineEmbeddingLoss`,其公式如下:
- en: '[![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)](../_static/img/knowledge_distillation/cosine_embedding_loss.png)'
- en: '![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)'
id: totrans-71
prefs: []
type: TYPE_NORMAL
zh: '![../_static/img/knowledge_distillation/cosine_embedding_loss.png](../Images/cdd423a58df099c1510863f187b76089.png)'
- en: Formula for CosineEmbeddingLoss
id: totrans-72
prefs: []
type: TYPE_NORMAL
zh: CosineEmbeddingLoss的公式
- en: Obviously, there is one thing that we need to resolve first. When we applied
distillation to the output layer we mentioned that both networks have the same
number of neurons, equal to the number of classes. However, this is not the case
......@@ -497,6 +499,7 @@
id: totrans-73
prefs: []
type: TYPE_NORMAL
zh: 显然,我们首先需要解决一件事情。当我们将蒸馏应用于输出层时,我们提到两个网络具有相同数量的神经元,等于类的数量。然而,在跟随我们的卷积层之后的层中并非如此。在这里,老师在最终卷积层展平后拥有比学生更多的神经元。我们的损失函数接受两个相同维度的向量作为输入,因此我们需要以某种方式将它们匹配。我们将通过在老师的卷积层后包含一个平均池化层来解决这个问题,以减少其维度以匹配学生的维度。
- en: To proceed, we will modify our model classes, or create new ones. Now, the forward
function returns not only the logits of the network but also the flattened hidden
representation after the convolutional layer. We include the aforementioned pooling
......
- en: Parallel and Distributed Training
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 并行和分布式训练
- en: Distributed and Parallel Training Tutorials
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 分布式和并行训练教程
- en: 原文:[https://pytorch.org/tutorials/distributed/home.html](https://pytorch.org/tutorials/distributed/home.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/distributed/home.html](https://pytorch.org/tutorials/distributed/home.html)
- en: Distributed training is a model training paradigm that involves spreading training
workload across multiple worker nodes, therefore significantly improving the speed
of training and model accuracy. While distributed training can be used for any
type of ML model training, it is most beneficial to use it for large models and
compute demanding tasks as deep learning.
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 分布式训练是一种模型训练范式,涉及将训练工作负载分布到多个工作节点,从而显著提高训练速度和模型准确性。虽然分布式训练可用于任何类型的ML模型训练,但对于大型模型和计算密集型任务(如深度学习)使用它最为有益。
- en: 'There are a few ways you can perform distributed training in PyTorch with each
method having their advantages in certain use cases:'
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 在PyTorch中有几种方法可以进行分布式训练,每种方法在特定用例中都有其优势:
- en: '[DistributedDataParallel (DDP)](#learn-ddp)'
id: totrans-4
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[DistributedDataParallel (DDP)](#learn-ddp)'
- en: '[Fully Sharded Data Parallel (FSDP)](#learn-fsdp)'
id: totrans-5
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[完全分片数据并行(FSDP)](#learn-fsdp)'
- en: '[Device Mesh](#device-mesh)'
id: totrans-6
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[设备网格](#device-mesh)'
- en: '[Remote Procedure Call (RPC) distributed training](#learn-rpc)'
id: totrans-7
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[远程过程调用(RPC)分布式训练](#learn-rpc)'
- en: '[Custom Extensions](#custom-extensions)'
id: totrans-8
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[自定义扩展](#custom-extensions)'
- en: Read more about these options in [Distributed Overview](../beginner/dist_overview.html).
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 在[分布式概述](../beginner/dist_overview.html)中了解更多关于这些选项的信息。
- en: '## Learn DDP'
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: '## 学习DDP'
- en: DDP Intro Video Tutorials
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: DDP简介视频教程
- en: A step-by-step video series on how to get started with DistributedDataParallel
and advance to more complex topics
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 一系列逐步视频教程,介绍如何开始使用DistributedDataParallel,并逐步深入更复杂的主题
- en: Code Video
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 代码视频
- en: Getting Started with Distributed Data Parallel
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: 开始使用分布式数据并行处理
- en: This tutorial provides a short and gentle intro to the PyTorch DistributedData
Parallel.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 本教程为PyTorch DistributedData Parallel提供了简短而温和的介绍。
- en: Code
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: Distributed Training with Uneven Inputs Using the Join Context Manager
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 使用Join上下文管理器进行不均匀输入的分布式训练
- en: This tutorial describes the Join context manager and demonstrates it’s use with
DistributedData Parallel.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 本教程描述了Join上下文管理器,并演示了如何与DistributedData Parallel一起使用。
- en: 'Code ## Learn FSDP'
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: '代码 ## 学习FSDP'
- en: Getting Started with FSDP
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 开始使用FSDP
- en: This tutorial demonstrates how you can perform distributed training with FSDP
on a MNIST dataset.
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 本教程演示了如何在MNIST数据集上使用FSDP进行分布式训练。
- en: Code
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: FSDP Advanced
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: FSDP 高级
- en: In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5 model
with FSDP for text summarization.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,您将学习如何使用FSDP对HuggingFace(HF)T5模型进行微调,用于文本摘要。
- en: 'Code ## Learn DeviceMesh'
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: '代码 ## 学习DeviceMesh'
- en: Getting Started with DeviceMesh
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 开始使用DeviceMesh
- en: In this tutorial you will learn about DeviceMesh and how it can help with distributed
training.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,您将了解DeviceMesh以及它如何帮助进行分布式训练。
- en: 'Code ## Learn RPC'
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: '代码 ## 学习RPC'
- en: Getting Started with Distributed RPC Framework
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 开始使用分布式RPC框架
- en: This tutorial demonstrates how to get started with RPC-based distributed training.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 本教程演示了如何开始使用基于RPC的分布式训练。
- en: Code
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: Implementing a Parameter Server Using Distributed RPC Framework
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 使用分布式RPC框架实现参数服务器
- en: This tutorial walks you through a simple example of implementing a parameter
server using PyTorch’s Distributed RPC framework.
id: totrans-33
prefs: []
type: TYPE_NORMAL
zh: 本教程将带您完成一个简单的示例,使用PyTorch的分布式RPC框架实现参数服务器。
- en: Code
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: Implementing Batch RPC Processing Using Asynchronous Executions
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 使用异步执行实现批处理RPC处理
- en: In this tutorial you will build batch-processing RPC applications with the @rpc.functions.async_execution
decorator.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,您将使用@rpc.functions.async_execution装饰器构建批处理RPC应用程序。
- en: Code
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: Combining Distributed DataParallel with Distributed RPC Framework
id: totrans-38
prefs: []
type: TYPE_NORMAL
zh: 将分布式DataParallel与分布式RPC框架结合
- en: In this tutorial you will learn how to combine distributed data parallelism
with distributed model parallelism.
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,您将学习如何将分布式数据并行性与分布式模型并行性结合起来。
- en: 'Code ## Custom Extensions'
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: '代码 ## 自定义扩展'
- en: Customize Process Group Backends Using Cpp Extensions
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 使用Cpp扩展自定义Process Group后端
- en: In this tutorial you will learn to implement a custom ProcessGroup backend and
plug that into PyTorch distributed package using cpp extensions.
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,您将学习如何实现自定义的ProcessGroup后端,并将其插入到PyTorch分布式包中使用cpp扩展。
- en: Code
id: totrans-43
prefs: []
type: TYPE_NORMAL
zh: 代码
- en: PyTorch Distributed Overview
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
- en: 原文:[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
- en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
- en: Note
id: totrans-3
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
- en: This is the overview page for the `torch.distributed` package. The goal of this
......@@ -21,14 +26,17 @@
of them. If this is your first time building distributed training applications
using PyTorch, it is recommended to use this document to navigate to the technology
that can best serve your use case.
id: totrans-5
prefs: []
type: TYPE_NORMAL
- en: Introduction
id: totrans-6
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: 'As of PyTorch v1.6.0, features in `torch.distributed` can be categorized into
three main components:'
id: totrans-7
prefs: []
type: TYPE_NORMAL
- en: '[Distributed Data-Parallel Training](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
......@@ -37,6 +45,7 @@
fed with a different set of input data samples. DDP takes care of gradient communication
to keep model replicas synchronized and overlaps it with the gradient computations
to speed up training.'
id: totrans-8
prefs:
- PREF_UL
type: TYPE_NORMAL
......@@ -46,6 +55,7 @@
and combinations of DDP with other training paradigms. It helps manage remote
object lifetime and extends the [autograd engine](https://pytorch.org/docs/stable/autograd.html)
beyond machine boundaries.'
id: totrans-9
prefs:
- PREF_UL
type: TYPE_NORMAL
......@@ -67,60 +77,78 @@
it also gives up the performance optimizations offered by DDP. [Writing Distributed
Applications with PyTorch](../intermediate/dist_tuto.html) shows examples of using
c10d communication APIs.'
id: totrans-10
prefs:
- PREF_UL
type: TYPE_NORMAL
- en: Data Parallel Training
id: totrans-11
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 数据并行训练
- en: 'PyTorch provides several options for data-parallel training. For applications
that gradually grow from simple to complex and from prototype to production, the
common development trajectory would be:'
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: PyTorch提供了几种数据并行训练的选项。对于从简单到复杂、从原型到生产逐渐增长的应用程序,常见的开发轨迹是:
- en: Use single-device training if the data and model can fit in one GPU, and training
speed is not a concern.
id: totrans-13
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 如果数据和模型可以适应一个GPU,并且训练速度不是问题,可以使用单设备训练。
- en: Use single-machine multi-GPU [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
to make use of multiple GPUs on a single machine to speed up training with minimal
code changes.
id: totrans-14
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 使用单机多GPU [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
来利用单台机器上的多个GPU加速训练,只需进行最少的代码更改。
- en: Use single-machine multi-GPU [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html),
if you would like to further speed up training and are willing to write a little
more code to set it up.
id: totrans-15
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 如果您希望进一步加快训练速度并愿意写更多代码来设置,可以使用单机多GPU [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)。
- en: Use multi-machine [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
and the [launching script](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md),
if the application needs to scale across machine boundaries.
id: totrans-16
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: Use multi-GPU [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)
training on a single-machine or multi-machine when the data and model cannot fit
on one GPU.
id: totrans-17
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: Use [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
to launch distributed training if errors (e.g., out-of-memory) are expected or
if resources can join and leave dynamically during training.
id: totrans-18
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: Note
id: totrans-19
prefs: []
type: TYPE_NORMAL
- en: Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus).
id: totrans-20
prefs: []
type: TYPE_NORMAL
- en: '`torch.nn.DataParallel`'
id: totrans-21
prefs:
- PREF_H3
type: TYPE_NORMAL
......@@ -132,9 +160,11 @@
performance because it replicates the model in every forward pass, and its single-process
multi-thread parallelism naturally suffers from [GIL](https://wiki.python.org/moin/GlobalInterpreterLock)
contention. To get better performance, consider using [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html).'
id: totrans-22
prefs: []
type: TYPE_NORMAL
- en: '`torch.nn.parallel.DistributedDataParallel`'
id: totrans-23
prefs:
- PREF_H3
type: TYPE_NORMAL
......@@ -146,43 +176,58 @@
of in every forward pass, which also helps to speed up training. DDP is shipped
with several performance optimization technologies. For a more in-depth explanation,
refer to this [paper](http://www.vldb.org/pvldb/vol13/p3005-li.pdf) (VLDB’20).
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 与[DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)相比,[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)需要多一步设置,即调用[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)。
DDP使用多进程并行,因此模型副本之间没有GIL争用。此外,模型在DDP构建时进行广播,而不是在每次前向传递中进行广播,这也有助于加快训练速度。 DDP配备了几种性能优化技术。有关更深入的解释,请参考这篇[论文](http://www.vldb.org/pvldb/vol13/p3005-li.pdf)(VLDB’20)。
- en: 'DDP materials are listed below:'
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: DDP材料如下:
- en: '[DDP notes](https://pytorch.org/docs/stable/notes/ddp.html) offer a starter
example and some brief descriptions of its design and implementation. If this
is your first time using DDP, start from this document.'
id: totrans-26
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[DDP笔记](https://pytorch.org/docs/stable/notes/ddp.html) 提供了一个入门示例以及对其设计和实现的简要描述。如果这是您第一次使用DDP,请从这个文档开始。'
- en: '[Getting Started with Distributed Data Parallel](../intermediate/ddp_tutorial.html)
explains some common problems with DDP training, including unbalanced workload,
checkpointing, and multi-device models. Note that, DDP can be easily combined
with single-machine multi-device model parallelism which is described in the [Single-Machine
Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html) tutorial.'
id: totrans-27
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用分布式数据并行开始](../intermediate/ddp_tutorial.html) 解释了DDP训练中的一些常见问题,包括负载不平衡、检查点和多设备模型。请注意,DDP可以很容易地与单机多设备模型并行结合,该模型并行在[单机模型并行最佳实践](../intermediate/model_parallel_tutorial.html)教程中有描述。'
- en: The [Launching and configuring distributed data parallel applications](https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md)
document shows how to use the DDP launching script.
id: totrans-28
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[启动和配置分布式数据并行应用程序](https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md)
文档展示了如何使用DDP启动脚本。'
- en: The [Shard Optimizer States With ZeroRedundancyOptimizer](../recipes/zero_redundancy_optimizer.html)
recipe demonstrates how [ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html)
helps to reduce optimizer memory footprint.
id: totrans-29
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: The [Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
tutorial walks through using the generic join context for distributed training
with uneven inputs.
id: totrans-30
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: '`torch.distributed.FullyShardedDataParallel`'
id: totrans-31
prefs:
- PREF_H3
type: TYPE_NORMAL
......@@ -192,9 +237,11 @@
data-parallel workers. The support for FSDP was added starting PyTorch v1.11\.
The tutorial [Getting Started with FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)
provides in depth explanation and example of how FSDP works.
id: totrans-32
prefs: []
type: TYPE_NORMAL
- en: torch.distributed.elastic
id: totrans-33
prefs:
- PREF_H3
type: TYPE_NORMAL
......@@ -208,9 +255,11 @@
(mismatched `AllReduce` operations) which would then cause a crash or hang. [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
adds fault tolerance and the ability to make use of a dynamic pool of machines
(elasticity).
id: totrans-34
prefs: []
type: TYPE_NORMAL
- en: RPC-Based Distributed Training
id: totrans-35
prefs:
- PREF_H2
type: TYPE_NORMAL
......@@ -218,20 +267,24 @@
paradigm, distributed pipeline parallelism, reinforcement learning applications
with multiple observers or agents, etc. [torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
aims at supporting general distributed training scenarios.
id: totrans-36
prefs: []
type: TYPE_NORMAL
- en: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) has four
main pillars:'
id: totrans-37
prefs: []
type: TYPE_NORMAL
- en: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) supports running a given
function on a remote worker.'
id: totrans-38
prefs:
- PREF_UL
type: TYPE_NORMAL
- en: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) helps to manage the lifetime
of a remote object. The reference counting protocol is presented in the [RRef
notes](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol).'
id: totrans-39
prefs:
- PREF_UL
type: TYPE_NORMAL
......@@ -239,28 +292,33 @@
extends the autograd engine beyond machine boundaries. Please refer to [Distributed
Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)
for more details.'
id: totrans-40
prefs:
- PREF_UL
type: TYPE_NORMAL
- en: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
automatically reaches out to all participating workers to update parameters using
gradients computed by the distributed autograd engine.'
id: totrans-41
prefs:
- PREF_UL
type: TYPE_NORMAL
- en: 'RPC Tutorials are listed below:'
id: totrans-42
prefs: []
type: TYPE_NORMAL
- en: The [Getting Started with Distributed RPC Framework](../intermediate/rpc_tutorial.html)
tutorial first uses a simple Reinforcement Learning (RL) example to demonstrate
RPC and RRef. Then, it applies a basic distributed model parallelism to an RNN
example to show how to use distributed autograd and distributed optimizer.
id: totrans-43
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: The [Implementing a Parameter Server Using Distributed RPC Framework](../intermediate/rpc_param_server_tutorial.html)
tutorial borrows the spirit of [HogWild! training](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)
and applies it to an asynchronous parameter server (PS) training application.
id: totrans-44
prefs:
- PREF_OL
type: TYPE_NORMAL
......@@ -268,6 +326,7 @@
tutorial extends the single-machine pipeline parallel example (presented in [Single-Machine
Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html))
to a distributed environment and shows how to implement it using RPC.
id: totrans-45
prefs:
- PREF_OL
type: TYPE_NORMAL
......@@ -275,20 +334,24 @@
tutorial demonstrates how to implement RPC batch processing using the [@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)
decorator, which can help speed up inference and training. It uses RL and PS examples
similar to those in the above tutorials 1 and 2.
id: totrans-46
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: The [Combining Distributed DataParallel with Distributed RPC Framework](../advanced/rpc_ddp_tutorial.html)
tutorial demonstrates how to combine DDP with RPC to train a model using distributed
data parallelism combined with distributed model parallelism.
id: totrans-47
prefs:
- PREF_OL
type: TYPE_NORMAL
- en: PyTorch Distributed Developers
id: totrans-48
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: If you’d like to contribute to PyTorch Distributed, please refer to our [Developer
Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
id: totrans-49
prefs: []
type: TYPE_NORMAL
......@@ -15,7 +15,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_tuto.rst).'
prefs: []
type: TYPE_NORMAL
......@@ -68,7 +68,7 @@
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: '[![Send and Recv](../Images/f29264b289639882a61fb5c3447b1ecc.png)](../_images/send_recv.png)'
- en: '![Send and Recv](../Images/f29264b289639882a61fb5c3447b1ecc.png)'
prefs: []
type: TYPE_NORMAL
- en: Send and Recv
......@@ -126,13 +126,13 @@
prefs:
- PREF_H2
type: TYPE_NORMAL
- en: '| [![Scatter](../Images/3aa3584628cb0526c8b0e9d02b15d876.png)](../_images/scatter.png)'
- en: '| ![Scatter](../Images/3aa3584628cb0526c8b0e9d02b15d876.png)'
prefs: []
type: TYPE_NORMAL
- en: Scatter
prefs: []
type: TYPE_NORMAL
- en: '| [![Gather](../Images/7e8670a3b7cdc7848394514ef1da090a.png)](../_images/gather.png)'
- en: '| ![Gather](../Images/7e8670a3b7cdc7848394514ef1da090a.png)'
prefs: []
type: TYPE_NORMAL
- en: Gather
......@@ -141,13 +141,13 @@
- en: '|'
prefs: []
type: TYPE_NORMAL
- en: '| [![Reduce](../Images/1c451df4406aea85e640d1ae7df6df31.png)](../_images/reduce.png)'
- en: '| ![Reduce](../Images/1c451df4406aea85e640d1ae7df6df31.png)'
prefs: []
type: TYPE_NORMAL
- en: Reduce
prefs: []
type: TYPE_NORMAL
- en: '| [![All-Reduce](../Images/0ef9693f0008d5a75aa5ac2b542b83ac.png)](../_images/all_reduce.png)'
- en: '| ![All-Reduce](../Images/0ef9693f0008d5a75aa5ac2b542b83ac.png)'
prefs: []
type: TYPE_NORMAL
- en: All-Reduce
......@@ -156,13 +156,13 @@
- en: '|'
prefs: []
type: TYPE_NORMAL
- en: '| [![Broadcast](../Images/525847c9d4b48933cb231204a2d13e0e.png)](../_images/broadcast.png)'
- en: '| ![Broadcast](../Images/525847c9d4b48933cb231204a2d13e0e.png)'
prefs: []
type: TYPE_NORMAL
- en: Broadcast
prefs: []
type: TYPE_NORMAL
- en: '| [![All-Gather](../Images/4a48977cd9545f897942a4a4ef1175ac.png)](../_images/all_gather.png)'
- en: '| ![All-Gather](../Images/4a48977cd9545f897942a4a4ef1175ac.png)'
prefs: []
type: TYPE_NORMAL
- en: All-Gather
......
......@@ -13,7 +13,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/FSDP_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......@@ -49,7 +49,7 @@
reduced by internal optimizations like overlapping communication and computation.
prefs: []
type: TYPE_NORMAL
- en: '[![FSDP workflow](../Images/4e33f1b27db65dbfcbcf54cce427e858.png)](../_images/fsdp_workflow.png)'
- en: '![FSDP workflow](../Images/4e33f1b27db65dbfcbcf54cce427e858.png)'
prefs: []
type: TYPE_NORMAL
- en: FSDP Workflow
......@@ -109,7 +109,7 @@
to collect and combine the updated parameter shards.
prefs: []
type: TYPE_NORMAL
- en: '[![FSDP allreduce](../Images/0e1d2209fe5b011d7237cb607289d4f1.png)](../_images/fsdp_sharding.png)'
- en: '![FSDP allreduce](../Images/0e1d2209fe5b011d7237cb607289d4f1.png)'
prefs: []
type: TYPE_NORMAL
- en: FSDP Allreduce
......@@ -210,7 +210,7 @@
AWS EC2 instance with 4 GPUs captured from PyTorch Profiler.
prefs: []
type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/c26c3d052bcb9f32ea5c7b3d9500d97a.png)](../_images/FSDP_memory.gif)'
- en: '![FSDP peak memory](../Images/c26c3d052bcb9f32ea5c7b3d9500d97a.png)'
prefs: []
type: TYPE_NORMAL
- en: FSDP Peak Memory Usage
......@@ -265,7 +265,7 @@
compared to FSDP without auto wrap policy applied, from ~75 MB to 66 MB.
prefs: []
type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/62842d10a3954d2d247fca536a0d7bfe.png)](../_images/FSDP_autowrap.gif)'
- en: '![FSDP peak memory](../Images/62842d10a3954d2d247fca536a0d7bfe.png)'
prefs: []
type: TYPE_NORMAL
- en: FSDP Peak Memory Usage using Auto_wrap policy
......@@ -309,7 +309,7 @@
AWS EC2 instance with 4 GPUs captured from PyTorch profiler.
prefs: []
type: TYPE_NORMAL
- en: '[![FSDP peak memory](../Images/b7af7a69ededd6326e3de004bb7b1e43.png)](../_images/DDP_memory.gif)'
- en: '![FSDP peak memory](../Images/b7af7a69ededd6326e3de004bb7b1e43.png)'
prefs: []
type: TYPE_NORMAL
- en: DDP Peak Memory Usage using Auto_wrap policy
......
......@@ -13,7 +13,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_param_server_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_pipeline_parallel_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/rpc_async_execution.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/rpc_ddp_tutorial.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -12,7 +12,7 @@
- en: Note
prefs: []
type: TYPE_NORMAL
- en: '[![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)](../_images/pencil-16.png)
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/generic_join.rst).'
prefs: []
type: TYPE_NORMAL
......
......@@ -261,8 +261,8 @@
see the following screens:'
prefs: []
type: TYPE_NORMAL
- en: '[![../_images/deeplabv3_ios.png](../Images/9ac919407ef21251c34a31f8fc79bd32.png)](../_images/deeplabv3_ios.png)
[![../_images/deeplabv3_ios2.png](../Images/48e025cda7e2c4c6a8cfe2a933cfd4f0.png)](../_images/deeplabv3_ios2.png)'
- en: '![../_images/deeplabv3_ios.png](../Images/9ac919407ef21251c34a31f8fc79bd32.png)
![../_images/deeplabv3_ios2.png](../Images/48e025cda7e2c4c6a8cfe2a933cfd4f0.png)'
prefs: []
type: TYPE_NORMAL
- en: Recap
......
......@@ -278,8 +278,8 @@
you will see screens like the following:'
prefs: []
type: TYPE_NORMAL
- en: '[![../_images/deeplabv3_android.png](../Images/1b0ecd17a6617abde8eb2e7e3409bbd0.png)](../_images/deeplabv3_android.png)
[![../_images/deeplabv3_android2.png](../Images/01e9b7b7725f15ac40b77b270306d4f8.png)](../_images/deeplabv3_android2.png)'
- en: '![../_images/deeplabv3_android.png](../Images/1b0ecd17a6617abde8eb2e7e3409bbd0.png)
![../_images/deeplabv3_android2.png](../Images/01e9b7b7725f15ac40b77b270306d4f8.png)'
prefs: []
type: TYPE_NORMAL
- en: Recap
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册