2024-02-04 13:00:18

0592328f · 绝不原创的飞龙 · b179fcf7 · 0592328f
隐藏空白更改
内联并排

Showing with 54 addition and 0 deletion

totrans/tut22_102.yaml totrans/tut22_102.yaml +54 -0

未找到文件。
--- a/totrans/tut22_102.yaml
+++ b/totrans/tut22_102.yaml
@@ -301,14 +301,17 @@
  id: totrans-48
  prefs: []
  type: TYPE_NORMAL
+  zh: 通常我们建议将一个进程绑定到本地插槽，以便该进程不会在不同插槽之间迁移。通常这样做的目的是利用本地内存上的高速缓存，并避免远程内存访问，后者可能慢大约2倍。
 - en: '[![../_images/6.png](../Images/4883ee88dea607f56c62f6bd09501713.png)](../_images/6.png)'
  id: totrans-49
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/6.png](../Images/4883ee88dea607f56c62f6bd09501713.png)](../_images/6.png)'
 - en: Figure 1\. Two-socket configuration
  id: totrans-50
  prefs: []
  type: TYPE_NORMAL
+  zh: 图1. 双插槽配置
 - en: Figure 1\. shows a typical two-socket configuration. Notice that each socket
    has its own local memory. Sockets are connected to each other via Intel Ultra
    Path Interconnect (UPI) which allows each socket to access the local memory of
@@ -317,14 +320,17 @@
  id: totrans-51
  prefs: []
  type: TYPE_NORMAL
+  zh: 图1. 显示了一个典型的双插槽配置。请注意，每个插槽都有自己的本地内存。插槽通过Intel Ultra Path Interconnect (UPI)连接到彼此，这允许每个插槽访问另一个插槽的本地内存，称为远程内存。本地内存访问始终比远程内存访问快。
 - en: '[![../_images/7.png](../Images/9f8fe3672209c49d3625946847233f4b.png)](../_images/7.png)'
  id: totrans-52
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/7.png](../Images/9f8fe3672209c49d3625946847233f4b.png)](../_images/7.png)'
 - en: Figure 2.1\. CPU information
  id: totrans-53
  prefs: []
  type: TYPE_NORMAL
+  zh: 图2.1. CPU信息
 - en: Users can get their CPU information by running `lscpu` command on their Linux
    machine. Figure 2.1\. shows an example of `lscpu` execution on a machine with
    two Intel(R) Xeon(R) Platinum 8180M CPUs. Notice that there are 28 cores per socket,
@@ -335,14 +341,19 @@
  id: totrans-54
  prefs: []
  type: TYPE_NORMAL
+  zh: 用户可以通过在他们的Linux机器上运行`lscpu`命令来获取他们的CPU信息。图2.1. 显示了在一台装有两个Intel(R) Xeon(R) Platinum
+    8180M CPU的机器上执行`lscpu`的示例。请注意，每个插槽有28个核心，每个核心有2个线程（即启用了超线程）。换句话说，除了28个物理核心外，还有28个逻辑核心，每个插槽总共有56个核心。而且有2个插槽，总共有112个核心（`每个核心的线程数`
+    x `每个插槽的核心数` x `插槽数`）。
 - en: '[![../_images/8.png](../Images/401010f7117d9febf2003c41ba5c4559.png)](../_images/8.png)'
  id: totrans-55
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/8.png](../Images/401010f7117d9febf2003c41ba5c4559.png)](../_images/8.png)'
 - en: Figure 2.2\. CPU information
  id: totrans-56
  prefs: []
  type: TYPE_NORMAL
+  zh: 图2.2. CPU信息
 - en: The 2 sockets are mapped to 2 NUMA nodes (NUMA node 0, NUMA node 1) respectively.
    Physical cores are indexed prior to logical cores. As shown in Figure 2.2., the
    first 28 physical cores (0-27) and the first 28 logical cores (56-83) on the first
@@ -353,35 +364,43 @@
  id: totrans-57
  prefs: []
  type: TYPE_NORMAL
+  zh: 这2个插槽分别映射到2个NUMA节点（NUMA节点0，NUMA节点1）。物理核心在逻辑核心之前进行索引。如图2.2.所示，第一个插槽上的前28个物理核心（0-27）和前28个逻辑核心（56-83）位于NUMA节点0。第二个插槽上的第28个物理核心（28-55）和第二个插槽上的第28个逻辑核心（84-111）位于NUMA节点1。同一插槽上的核心共享本地内存和最后一级缓存（LLC），比通过Intel
+    UPI进行跨插槽通信要快得多。
 - en: Now that we understand NUMA, cross-socket (UPI) traffic, local vs. remote memory
    access in multi-processor systems, let’s profile and verify our understanding.
  id: totrans-58
  prefs: []
  type: TYPE_NORMAL
+  zh: 现在我们了解了NUMA、跨插槽（UPI）流量、多处理器系统中的本地与远程内存访问，让我们对其进行分析和验证。
 - en: '**Exercise**'
  id: totrans-59
  prefs: []
  type: TYPE_NORMAL
+  zh: '**练习**'
 - en: We’ll reuse the ResNet50 example above.
  id: totrans-60
  prefs: []
  type: TYPE_NORMAL
+  zh: 我们将重用上面的ResNet50示例。
 - en: As we did not pin threads to processor cores of a specific socket, the operating
    system periodically schedules threads on processor cores located in different
    sockets.
  id: totrans-61
  prefs: []
  type: TYPE_NORMAL
+  zh: 由于我们没有将线程固定到特定插槽的处理器核心上，操作系统会定期将线程调度到位于不同插槽中的处理器核心上。
 - en: '[![../_images/9.gif](../Images/34e582c1c262c693f8472ce4254570ba.png)](../_images/9.gif)'
  id: totrans-62
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/9.gif](../Images/34e582c1c262c693f8472ce4254570ba.png)](../_images/9.gif)'
 - en: Figure 3\. CPU usage of non NUMA-aware application. 1 main worker thread was
    launched, then it launched a physical core number (56) of threads on all cores,
    including logical cores.
  id: totrans-63
  prefs: []
  type: TYPE_NORMAL
+  zh: 图3. 非NUMA感知应用程序的CPU使用情况。启动了1个主工作线程，然后在所有核心上启动了一个物理核心编号（56）的线程，包括逻辑核心。
 - en: '(Aside: If the number of threads is not set by [torch.set_num_threads](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html),
    the default number of threads is the number of physical cores in a hyperthreading
    enabled system. This can be verified by [torch.get_num_threads](https://pytorch.org/docs/stable/generated/torch.get_num_threads.html).
@@ -389,24 +408,29 @@
  id: totrans-64
  prefs: []
  type: TYPE_NORMAL
+  zh: （附注：如果线程数未通过[torch.set_num_threads](https://pytorch.org/docs/stable/generated/torch.set_num_threads.html)设置，那么默认线程数是启用超线程系统中的物理核心数。这可以通过[torch.get_num_threads](https://pytorch.org/docs/stable/generated/torch.get_num_threads.html)来验证。因此，我们看到大约一半的核心忙于运行示例脚本。）
 - en: '[![../_images/10.png](../Images/19d05986d8f0236e3463065f779359e1.png)](../_images/10.png)'
  id: totrans-65
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/10.png](../Images/19d05986d8f0236e3463065f779359e1.png)](../_images/10.png)'
 - en: Figure 4\. Non-Uniform Memory Access Analysis graph
  id: totrans-66
  prefs: []
  type: TYPE_NORMAL
+  zh: 图4. 非均匀内存访问分析图
 - en: Figure 4\. compares local vs. remote memory access over time. We verify usage
    of remote memory which could result in sub-optimal performance.
  id: totrans-67
  prefs: []
  type: TYPE_NORMAL
+  zh: 图4. 比较了随时间变化的本地与远程内存访问。我们验证了远程内存的使用，这可能导致性能不佳。
 - en: '**Set thread affinity to reduce remote memory access and cross-socket (UPI)
    traffic**'
  id: totrans-68
  prefs: []
  type: TYPE_NORMAL
+  zh: '**设置线程亲和性以减少远程内存访问和跨插槽（UPI）流量**'
 - en: Pinning threads to cores on the same socket helps maintain locality of memory
    access. In this example, we’ll pin to the physical cores on the first NUMA node
    (0-27). With the launch script, users can easily experiment with NUMA nodes configuration
@@ -414,18 +438,22 @@
  id: totrans-69
  prefs: []
  type: TYPE_NORMAL
+  zh: 将线程固定到同一插槽上的核心有助于保持内存访问的局部性。在这个例子中，我们将固定到第一个NUMA节点上的物理核心（0-27）。通过启动脚本，用户可以通过简单切换`--node_id`启动脚本旋钮来轻松尝试NUMA节点配置。
 - en: Let’s visualize the CPU usage now.
  id: totrans-70
  prefs: []
  type: TYPE_NORMAL
+  zh: 让我们现在来可视化CPU使用情况。
 - en: '[![../_images/11.gif](../Images/7dad920dd5537d51cf51649c1e4da9d5.png)](../_images/11.gif)'
  id: totrans-71
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/11.gif](../Images/7dad920dd5537d51cf51649c1e4da9d5.png)](../_images/11.gif)'
 - en: Figure 5\. CPU usage of NUMA-aware application
  id: totrans-72
  prefs: []
  type: TYPE_NORMAL
+  zh: 图5. NUMA感知应用程序的CPU使用情况
 - en: 1 main worker thread was launched, then it launched threads on all physical
    cores on the first numa node.
  id: totrans-73
@@ -582,11 +610,13 @@
  id: totrans-96
  prefs: []
  type: TYPE_NORMAL
+  zh: 未填充微操作（uOps）的空管道槽位被归因于停顿。例如，没有核心固定时，CPU使用率可能不会有效地用于计算，而是用于来自Linux内核的其他操作，如线程调度。我们可以看到`__sched_yield`贡献了大部分自旋时间。
 - en: Thread Migration
  id: totrans-97
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 线程迁移
 - en: Without core pinning, scheduler may migrate thread executing on a core to a
    different core. Thread migration can disassociate the thread from data that has
    already been fetched into the caches resulting in longer data access latencies.
@@ -596,10 +626,12 @@
  id: totrans-98
  prefs: []
  type: TYPE_NORMAL
+  zh: 没有核心固定，调度程序可能会将在一个核心上执行的线程迁移到另一个核心。线程迁移可能会使线程与已经获取到缓存中的数据分离，导致更长的数据访问延迟。在NUMA系统中，当线程在插座之间迁移时，这个问题会加剧。已经获取到本地内存高速缓存的数据现在变成了远程内存，速度要慢得多。
 - en: '[![../_images/17.png](../Images/2a12b3c78a1e37b95144e879414c7bee.png)](../_images/17.png)'
  id: totrans-99
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/17.png](../Images/2a12b3c78a1e37b95144e879414c7bee.png)](../_images/17.png)'
 - en: Generally the total number of threads should be less than or equal to the total
    number of threads supported by the core. In the above example, we notice a large
    number of threads executing on core_51 instead of the expected 2 threads (since
@@ -608,10 +640,13 @@
  id: totrans-100
  prefs: []
  type: TYPE_NORMAL
+  zh: 通常，总线程数应小于或等于核心支持的总线程数。在上面的示例中，我们注意到大量线程在core_51上执行，而不是预期的2个线程（因为Intel(R) Xeon(R)
+    Platinum 8180 CPU启用了超线程）。这表明线程迁移。
 - en: '[![../_images/18.png](../Images/3b6d2c15f792114de5f771ac3eabe97e.png)](../_images/18.png)'
  id: totrans-101
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/18.png](../Images/3b6d2c15f792114de5f771ac3eabe97e.png)](../_images/18.png)'
 - en: Additionally, notice that thread (TID:97097) was executing on a large number
    of CPU cores, indicating CPU migration. For example, this thread was executing
    on cpu_81, then migrated to cpu_14, then migrated to cpu_5, and so on. Furthermore,
@@ -622,33 +657,39 @@
  id: totrans-102
  prefs: []
  type: TYPE_NORMAL
+  zh: 此外，请注意线程（TID:97097）正在大量CPU核心上执行，表明CPU迁移。例如，此线程在cpu_81上执行，然后迁移到cpu_14，然后迁移到cpu_5，依此类推。此外，请注意此线程多次在不同插槽之间迁移，导致内存访问非常低效。例如，此线程在cpu_70（NUMA节点0）上执行，然后迁移到cpu_100（NUMA节点1），然后迁移到cpu_24（NUMA节点0）。
 - en: Non Uniform Memory Access Analysis
  id: totrans-103
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 非均匀内存访问分析
 - en: '[![../_images/19.png](../Images/eb535b6f0a49e85a2f6c69b3307eb58d.png)](../_images/19.png)'
  id: totrans-104
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/19.png](../Images/eb535b6f0a49e85a2f6c69b3307eb58d.png)](../_images/19.png)'
 - en: Compare local vs. remote memory access over time. We observe that about half,
    51.09%, of the memory accesses were remote accesses, indicating sub-optimal NUMA
    configuration.
  id: totrans-105
  prefs: []
  type: TYPE_NORMAL
+  zh: 比较随时间变化的本地与远程内存访问。我们观察到大约一半，即51.09%，的内存访问是远程访问，表明NUMA配置不佳。
 - en: 2\. torch.set_num_threads = `number of physical cores / number of workers` (no
    core pinning)
  id: totrans-106
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 2\. torch.set_num_threads = `物理核心数/工作线程数`（无核心固定）
 - en: 'For an apple-to-apple comparison with launcher’s core pinning, we’ll set the
    number of threads to the number of cores divided by the number of workers (launcher
    does this internally). Add the following code snippet in the [base_handler](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py):'
  id: totrans-107
  prefs: []
  type: TYPE_NORMAL
+  zh: 为了与启动器的核心固定进行苹果对苹果的比较，我们将将线程数设置为核心数除以工作线程数（启动器在内部执行此操作）。在[base_handler](https://github.com/pytorch/serve/blob/master/ts/torch_handler/base_handler.py)中添加以下代码片段：
 - en: '[PRE1]'
  id: totrans-108
  prefs: []
@@ -660,47 +701,57 @@
  id: totrans-109
  prefs: []
  type: TYPE_NORMAL
+  zh: 与之前一样，没有核心固定，这些线程没有与特定CPU核心关联，导致操作系统周期性地在位于不同插座的核心上调度线程。
 - en: CPU usage
  id: totrans-110
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: CPU使用率
 - en: '[![../_images/20.gif](../Images/324a4abab610044d4447be1f203420e5.png)](../_images/20.gif)'
  id: totrans-111
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/20.gif](../Images/324a4abab610044d4447be1f203420e5.png)](../_images/20.gif)'
 - en: 4 main worker threads were launched, then each launched a `num_physical_cores/num_workers`
    number (14) of threads on all cores, including logical cores.
  id: totrans-112
  prefs: []
  type: TYPE_NORMAL
+  zh: 启动了4个主要工作线程，然后每个线程在所有核心上启动了`num_physical_cores/num_workers`（14）个线程，包括逻辑核心。
 - en: Core Bound stalls
  id: totrans-113
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 核心绑定停顿
 - en: '[![../_images/21.png](../Images/c8d610422cd75e7798fb7a63febc2cf1.png)](../_images/21.png)'
  id: totrans-114
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/21.png](../Images/c8d610422cd75e7798fb7a63febc2cf1.png)](../_images/21.png)'
 - en: Although the percentage of Core Bound stalls has decreased from 88.4% to 73.5%,
    the Core Bound is still very high.
  id: totrans-115
  prefs: []
  type: TYPE_NORMAL
+  zh: 尽管核心绑定停顿的百分比从88.4%降至73.5%，但核心绑定仍然非常高。
 - en: '[![../_images/22.png](../Images/38f5ccd2ad70dfa738cd25d4795a2103.png)](../_images/22.png)[![../_images/23.png](../Images/27c04f2268b595a2bd5299b190141526.png)](../_images/23.png)'
  id: totrans-116
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/22.png](../Images/38f5ccd2ad70dfa738cd25d4795a2103.png)](../_images/22.png)[![../_images/23.png](../Images/27c04f2268b595a2bd5299b190141526.png)](../_images/23.png)'
 - en: Thread Migration
  id: totrans-117
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 线程迁移
 - en: '[![../_images/24.png](../Images/f5b0a04ed3cae12a3b77548c56420027.png)](../_images/24.png)'
  id: totrans-118
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/24.png](../Images/f5b0a04ed3cae12a3b77548c56420027.png)](../_images/24.png)'
 - en: Similar as before, without core pinning thread (TID:94290) was executing on
    a large number of CPU cores, indicating CPU migration. We notice again cross-socket
    thread migration, resulting in very inefficient memory access. For example, this
@@ -708,15 +759,18 @@
  id: totrans-119
  prefs: []
  type: TYPE_NORMAL
+  zh: 与之前类似，没有核心固定时，线程（TID:94290）在大量CPU核心上执行，表明CPU迁移。我们再次注意到跨插槽的线程迁移，导致内存访问非常低效。例如，此线程在cpu_78（NUMA节点0）上执行，然后迁移到cpu_108（NUMA节点1）。
 - en: Non Uniform Memory Access Analysis
  id: totrans-120
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 非均匀内存访问分析
 - en: '[![../_images/25.png](../Images/6d7d4b4277221a27dd70ad2125f77d31.png)](../_images/25.png)'
  id: totrans-121
  prefs: []
  type: TYPE_NORMAL
+  zh: '[![../_images/25.png](../Images/6d7d4b4277221a27dd70ad2125f77d31.png)](../_images/25.png)'
 - en: Although an improvement from the original 51.09%, still 40.45% of memory access
    is remote, indicating sub-optimal NUMA configuration.
  id: totrans-122