2024-02-04 14:21:34

a60b5fd8 · 绝不原创的飞龙 · 26c2d4ae · a60b5fd8 · a60b5fd8 · a60b5fd8
隐藏空白更改
内联并排

Showing with 169 addition and 0 deletion

totrans/tut22_112.yaml totrans/tut22_112.yaml +53 -0

totrans/tut22_113.yaml totrans/tut22_113.yaml +40 -0

totrans/tut22_114.yaml totrans/tut22_114.yaml +76 -0

未找到文件。
--- a/totrans/tut22_112.yaml
+++ b/totrans/tut22_112.yaml
@@ -3,24 +3,30 @@
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
+  zh: PyTorch 分布式概述
 - en: 原文：[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
+  zh: 原文：[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
 - en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
+  zh: '**作者**：[Shen Li](https://mrshenli.github.io/)'
 - en: Note
  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
+  zh: 注意
 - en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
    tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
+  zh: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) 在 [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst)
+    中查看并编辑本教程。'
 - en: This is the overview page for the `torch.distributed` package. The goal of this
    page is to categorize documents into different topics and briefly describe each
    of them. If this is your first time building distributed training applications
@@ -29,16 +35,19 @@
  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
+  zh: 这是 `torch.distributed` 包的概述页面。本页面的目标是将文档分类为不同主题，并简要描述每个主题。如果这是您第一次使用PyTorch构建分布式训练应用程序，建议使用本文档导航到最适合您用例的技术。
 - en: Introduction
  id: totrans-6
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 介绍
 - en: 'As of PyTorch v1.6.0, features in `torch.distributed` can be categorized into
    three main components:'
  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
+  zh: 截至PyTorch v1.6.0，`torch.distributed` 中的功能可以分为三个主要组件：
 - en: '[Distributed Data-Parallel Training](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
    (DDP) is a widely adopted single-program multiple-data training paradigm. With
    DDP, the model is replicated on every process, and every model replica will be
@@ -49,6 +58,7 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[分布式数据并行训练](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)（DDP）是一种广泛采用的单程序多数据训练范式。使用DDP，模型在每个进程上被复制，并且每个模型副本将被提供不同的输入数据样本。DDP负责梯度通信以保持模型副本同步，并将其与梯度计算重叠以加快训练速度。'
 - en: '[RPC-Based Distributed Training](https://pytorch.org/docs/stable/rpc.html)
    (RPC) supports general training structures that cannot fit into data-parallel
    training such as distributed pipeline parallelism, parameter server paradigm,
@@ -59,6 +69,7 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[基于RPC的分布式训练](https://pytorch.org/docs/stable/rpc.html)（RPC）支持无法适应数据并行训练的一般训练结构，如分布式管道并行性、参数服务器范式以及DDP与其他训练范式的组合。它有助于管理远程对象的生命周期，并将[autograd引擎](https://pytorch.org/docs/stable/autograd.html)扩展到机器边界之外。'
 - en: '[Collective Communication](https://pytorch.org/docs/stable/distributed.html)
    (c10d) library supports sending tensors across processes within a group. It offers
    both collective communication APIs (e.g., [all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)
@@ -81,6 +92,9 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[集体通信](https://pytorch.org/docs/stable/distributed.html)（c10d）库支持在组内的进程之间发送张量。它提供了集体通信API（例如，[all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)和[all_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)）以及P2P通信API（例如，[send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send)和[isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend)）。DDP和RPC（[ProcessGroup
+    Backend](https://pytorch.org/docs/stable/rpc.html#process-group-backend)）构建在c10d之上，前者使用集体通信，后者使用P2P通信。通常，开发人员不需要直接使用这个原始通信API，因为DDP和RPC
+    API可以满足许多分布式训练场景。然而，仍有一些用例可以从这个API中获益。一个例子是分布式参数平均化，应用程序希望在反向传播后计算所有模型参数的平均值，而不是使用DDP来通信梯度。这可以将通信与计算分离，并允许更精细地控制要通信的内容，但另一方面，也放弃了DDP提供的性能优化。[使用PyTorch编写分布式应用程序](../intermediate/dist_tuto.html)展示了使用c10d通信API的示例。'
 - en: Data Parallel Training
  id: totrans-11
  prefs:
@@ -125,6 +139,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 如果应用程序需要跨机器边界扩展，请使用多机器[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)和[启动脚本](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md)。
 - en: Use multi-GPU [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)
    training on a single-machine or multi-machine when the data and model cannot fit
    on one GPU.
@@ -132,6 +147,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 当数据和模型无法放入一个GPU时，在单机或多机上使用多GPU的[FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)训练。
 - en: Use [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
    to launch distributed training if errors (e.g., out-of-memory) are expected or
    if resources can join and leave dynamically during training.
@@ -139,19 +155,23 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 使用[torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)来启动分布式训练，如果预期会出现错误（例如，内存不足），或者在训练过程中资源可以动态加入和离开。
 - en: Note
  id: totrans-19
  prefs: []
  type: TYPE_NORMAL
+  zh: 注意
 - en: Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus).
  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
+  zh: 数据并行训练也可以与[Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus)一起使用。
 - en: '`torch.nn.DataParallel`'
  id: totrans-21
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
+  zh: '`torch.nn.DataParallel`'
 - en: 'The [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
    package enables single-machine multi-GPU parallelism with the lowest coding hurdle.
    It only requires a one-line change to the application code. The tutorial [Optional:
@@ -163,11 +183,16 @@
  id: totrans-22
  prefs: []
  type: TYPE_NORMAL
+  zh: '[DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
+    包能够在单机多GPU上实现并行计算，且编码难度最低。只需要在应用代码中进行一行更改。教程 [Optional: Data Parallelism](../beginner/blitz/data_parallel_tutorial.html)
+    展示了一个例子。虽然 `DataParallel` 很容易使用，但通常性能不是最佳的，因为它在每次前向传播中都会复制模型，并且其单进程多线程并行自然受到 [GIL](https://wiki.python.org/moin/GlobalInterpreterLock)
+    的影响。为了获得更好的性能，考虑使用 [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)。'
 - en: '`torch.nn.parallel.DistributedDataParallel`'
  id: totrans-23
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
+  zh: '`torch.nn.parallel.DistributedDataParallel`'
 - en: Compared to [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html),
    [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
    requires one more step to set up, i.e., calling [init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
@@ -219,6 +244,8 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[Shard Optimizer States With ZeroRedundancyOptimizer](../recipes/zero_redundancy_optimizer.html)
+    配方演示了如何使用[ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html)来减少优化器的内存占用。'
 - en: The [Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
    tutorial walks through using the generic join context for distributed training
    with uneven inputs.
@@ -226,11 +253,14 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
+    教程介绍了如何使用通用的连接上下文管理器进行不均匀输入的分布式训练。'
 - en: '`torch.distributed.FullyShardedDataParallel`'
  id: totrans-31
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
+  zh: '`torch.distributed.FullyShardedDataParallel`'
 - en: The [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP)
    is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
    parameters, gradients and optimizer states, it shards all of these states across
@@ -240,11 +270,14 @@
  id: totrans-32
  prefs: []
  type: TYPE_NORMAL
+  zh: '[FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)（FSDP）是一种数据并行范例，它在每个GPU上维护模型参数、梯度和优化器状态的副本，将所有这些状态分片到数据并行工作器中。对FSDP的支持从PyTorch
+    v1.11开始添加。教程[Getting Started with FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)提供了关于FSDP如何工作的深入解释和示例。'
 - en: torch.distributed.elastic
  id: totrans-33
  prefs:
  - PREF_H3
  type: TYPE_NORMAL
+  zh: torch.distributed.elastic
 - en: With the growth of the application complexity and scale, failure recovery becomes
    a requirement. Sometimes it is inevitable to hit errors like out-of-memory (OOM)
    when using DDP, but DDP itself cannot recover from those errors, and it is not
@@ -258,11 +291,14 @@
  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
+  zh: 随着应用程序复杂性和规模的增长，故障恢复变得必不可少。在使用DDP时，有时会不可避免地遇到诸如内存溢出（OOM）等错误，但DDP本身无法从这些错误中恢复，也无法使用标准的`try-except`结构来处理它们。这是因为DDP要求所有进程以密切同步的方式运行，并且在不同进程中启动的所有`AllReduce`通信必须匹配。如果组中的一个进程抛出异常，很可能会导致不同步（不匹配的`AllReduce`操作），从而导致崩溃或挂起。[torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
+    添加了容错性和利用动态机器池（弹性）的能力。
 - en: RPC-Based Distributed Training
  id: totrans-35
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 基于RPC的分布式训练
 - en: Many training paradigms do not fit into data parallelism, e.g., parameter server
    paradigm, distributed pipeline parallelism, reinforcement learning applications
    with multiple observers or agents, etc. [torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
@@ -270,17 +306,21 @@
  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
+  zh: 许多训练范式不适合数据并行ism，例如参数服务器范式、分布式管道并行ism、具有多个观察者或代理的强化学习应用等。[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
+    旨在支持一般的分布式训练场景。
 - en: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) has four
    main pillars:'
  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
+  zh: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) 有四个主要支柱：'
 - en: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) supports running a given
    function on a remote worker.'
  id: totrans-38
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) 支持在远程工作节点上运行给定函数。'
 - en: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) helps to manage the lifetime
    of a remote object. The reference counting protocol is presented in the [RRef
    notes](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol).'
@@ -288,6 +328,7 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) 帮助管理远程对象的生命周期。引用计数协议在[RRef笔记](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol)中介绍。'
 - en: '[Distributed Autograd](https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework)
    extends the autograd engine beyond machine boundaries. Please refer to [Distributed
    Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)
@@ -296,6 +337,8 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[Distributed Autograd](https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework)
+    将自动求导引擎扩展到机器边界之外。更多细节请参考[Distributed Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)。'
 - en: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
    automatically reaches out to all participating workers to update parameters using
    gradients computed by the distributed autograd engine.'
@@ -303,10 +346,13 @@
  prefs:
  - PREF_UL
  type: TYPE_NORMAL
+  zh: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
+    自动联系所有参与的工作节点，使用分布式自动求导引擎计算的梯度来更新参数。'
 - en: 'RPC Tutorials are listed below:'
  id: totrans-42
  prefs: []
  type: TYPE_NORMAL
+  zh: RPC教程如下：
 - en: The [Getting Started with Distributed RPC Framework](../intermediate/rpc_tutorial.html)
    tutorial first uses a simple Reinforcement Learning (RL) example to demonstrate
    RPC and RRef. Then, it applies a basic distributed model parallelism to an RNN
@@ -315,6 +361,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用分布式RPC框架入门](../intermediate/rpc_tutorial.html) 教程首先使用一个简单的强化学习（RL）示例来演示RPC和RRef。然后，它将基本的分布式模型并行应用到一个RNN示例中，展示如何使用分布式自动求导和分布式优化器。'
 - en: The [Implementing a Parameter Server Using Distributed RPC Framework](../intermediate/rpc_param_server_tutorial.html)
    tutorial borrows the spirit of [HogWild! training](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)
    and applies it to an asynchronous parameter server (PS) training application.
@@ -322,6 +369,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用分布式RPC框架实现参数服务器](../intermediate/rpc_param_server_tutorial.html) 教程借鉴了[HogWild!训练](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)的精神，并将其应用于异步参数服务器（PS）训练应用。'
 - en: The [Distributed Pipeline Parallelism Using RPC](../intermediate/dist_pipeline_parallel_tutorial.html)
    tutorial extends the single-machine pipeline parallel example (presented in [Single-Machine
    Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html))
@@ -330,6 +378,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用RPC实现分布式管道并行性](../intermediate/dist_pipeline_parallel_tutorial.html) 教程将单机管道并行示例（在[单机模型并行最佳实践](../intermediate/model_parallel_tutorial.html)中介绍）扩展到分布式环境，并展示如何使用RPC实现它。'
 - en: The [Implementing Batch RPC Processing Using Asynchronous Executions](../intermediate/rpc_async_execution.html)
    tutorial demonstrates how to implement RPC batch processing using the [@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)
    decorator, which can help speed up inference and training. It uses RL and PS examples
@@ -338,6 +387,7 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用异步执行实现批量RPC处理的教程](../intermediate/rpc_async_execution.html)演示了如何使用[@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)装饰器实现RPC批处理，这可以帮助加速推理和训练。它使用了类似于上述教程1和2中的RL和PS示例。'
 - en: The [Combining Distributed DataParallel with Distributed RPC Framework](../advanced/rpc_ddp_tutorial.html)
    tutorial demonstrates how to combine DDP with RPC to train a model using distributed
    data parallelism combined with distributed model parallelism.
@@ -345,13 +395,16 @@
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[将分布式数据并行与分布式RPC框架相结合的教程](../advanced/rpc_ddp_tutorial.html)演示了如何将DDP与RPC结合起来，使用分布式数据并行性和分布式模型并行性来训练模型。'
 - en: PyTorch Distributed Developers
  id: totrans-48
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: PyTorch分布式开发者
 - en: If you’d like to contribute to PyTorch Distributed, please refer to our [Developer
    Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
  id: totrans-49
  prefs: []
  type: TYPE_NORMAL
+  zh: 如果您想为PyTorch分布式做出贡献，请参考我们的[开发者指南](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md)。
--- a/totrans/tut22_113.yaml
+++ b/totrans/tut22_113.yaml
 - en: Distributed Data Parallel in PyTorch - Video Tutorials
+  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
+  zh: PyTorch中的分布式数据并行 - 视频教程
 - en: 原文：[https://pytorch.org/tutorials/beginner/ddp_series_intro.html](https://pytorch.org/tutorials/beginner/ddp_series_intro.html)
+  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
+  zh: 原文：[https://pytorch.org/tutorials/beginner/ddp_series_intro.html](https://pytorch.org/tutorials/beginner/ddp_series_intro.html)
 - en: '**Introduction** || [What is DDP](ddp_series_theory.html) || [Single-Node Multi-GPU
    Training](ddp_series_multigpu.html) || [Fault Tolerance](ddp_series_fault_tolerance.html)
    || [Multi-Node training](../intermediate/ddp_series_multinode.html) || [minGPT
    Training](../intermediate/ddp_series_minGPT.html)'
+  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
+  zh: '**介绍** || [什么是DDP](ddp_series_theory.html) || [单节点多GPU训练](ddp_series_multigpu.html)
+    || [容错性](ddp_series_fault_tolerance.html) || [多节点训练](../intermediate/ddp_series_multinode.html)
+    || [minGPT训练](../intermediate/ddp_series_minGPT.html)'
 - en: 'Authors: [Suraj Subramanian](https://github.com/suraj813)'
+  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
+  zh: 作者：[Suraj Subramanian](https://github.com/suraj813)
 - en: Follow along with the video below or on [youtube](https://www.youtube.com/watch/-K3bZYHYHEA).
+  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
+  zh: 跟随下面的视频或在[youtube](https://www.youtube.com/watch/-K3bZYHYHEA)上观看。
 - en: '[https://www.youtube.com/embed/-K3bZYHYHEA](https://www.youtube.com/embed/-K3bZYHYHEA)'
+  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
+  zh: '[https://www.youtube.com/embed/-K3bZYHYHEA](https://www.youtube.com/embed/-K3bZYHYHEA)'
 - en: This series of video tutorials walks you through distributed training in PyTorch
    via DDP.
+  id: totrans-6
  prefs: []
  type: TYPE_NORMAL
+  zh: 这一系列视频教程将带您了解通过DDP在PyTorch中进行分布式训练。
 - en: The series starts with a simple non-distributed training job, and ends with
    deploying a training job across several machines in a cluster. Along the way,
    you will also learn about [torchrun](https://pytorch.org/docs/stable/elastic/run.html)
    for fault-tolerant distributed training.
+  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
+  zh: 该系列从简单的非分布式训练作业开始，最终部署到集群中的多台机器上进行训练。在此过程中，您还将了解到关于[torchrun](https://pytorch.org/docs/stable/elastic/run.html)用于容错分布式训练。
 - en: The tutorial assumes a basic familiarity with model training in PyTorch.
+  id: totrans-8
  prefs: []
  type: TYPE_NORMAL
+  zh: 本教程假定您对PyTorch中的模型训练有基本的了解。
 - en: Running the code
+  id: totrans-9
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 运行代码
 - en: You will need multiple CUDA GPUs to run the tutorial code. Typically, this can
    be done on a cloud instance with multiple GPUs (the tutorials use an Amazon EC2
    P3 instance with 4 GPUs).
+  id: totrans-10
  prefs: []
  type: TYPE_NORMAL
+  zh: 您需要多个CUDA GPU来运行教程代码。通常可以在具有多个GPU的云实例上完成此操作（教程使用具有4个GPU的Amazon EC2 P3实例）。
 - en: The tutorial code is hosted in this [github repo](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series).
    Clone the repository and follow along!
+  id: totrans-11
  prefs: []
  type: TYPE_NORMAL
+  zh: 教程代码托管在这个[github仓库](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series)。克隆该仓库并跟随教程！
 - en: Tutorial sections
+  id: totrans-12
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 教程部分
 - en: Introduction (this page)
+  id: totrans-13
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: 介绍（本页）
 - en: '[What is DDP?](ddp_series_theory.html) Gently introduces what DDP is doing
    under the hood'
+  id: totrans-14
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[DDP是什么？](ddp_series_theory.html) 温和地介绍了DDP在幕后的工作'
 - en: '[Single-Node Multi-GPU Training](ddp_series_multigpu.html) Training models
    using multiple GPUs on a single machine'
+  id: totrans-15
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[单节点多GPU训练](ddp_series_multigpu.html) 在单台机器上使用多个GPU训练模型'
 - en: '[Fault-tolerant distributed training](ddp_series_fault_tolerance.html) Making
    your distributed training job robust with torchrun'
+  id: totrans-16
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[容错分布式训练](ddp_series_fault_tolerance.html) 使用torchrun使您的分布式训练工作更加稳健'
 - en: '[Multi-Node training](../intermediate/ddp_series_multinode.html) Training models
    using multiple GPUs on multiple machines'
+  id: totrans-17
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[多节点训练](../intermediate/ddp_series_multinode.html) 使用多台机器上的多个GPU进行模型训练'
 - en: '[Training a GPT model with DDP](../intermediate/ddp_series_minGPT.html) “Real-world”
    example of training a [minGPT](https://github.com/karpathy/minGPT) model with
    DDP'
+  id: totrans-18
  prefs:
  - PREF_OL
  type: TYPE_NORMAL
+  zh: '[使用DDP训练GPT模型](../intermediate/ddp_series_minGPT.html) 使用DDP训练[minGPT](https://github.com/karpathy/minGPT)模型的“真实世界”示例'
--- a/totrans/tut22_114.yaml
+++ b/totrans/tut22_114.yaml
 - en: Single-Machine Model Parallel Best Practices
+  id: totrans-0
  prefs:
  - PREF_H1
  type: TYPE_NORMAL
+  zh: 单机模型并行最佳实践
 - en: 原文：[https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
+  id: totrans-1
  prefs:
  - PREF_BQ
  type: TYPE_NORMAL
+  zh: 原文：[https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
 - en: Note
+  id: totrans-2
  prefs: []
  type: TYPE_NORMAL
+  zh: 注意
 - en: Click [here](#sphx-glr-download-intermediate-model-parallel-tutorial-py) to
    download the full example code
+  id: totrans-3
  prefs: []
  type: TYPE_NORMAL
+  zh: 点击[这里](#sphx-glr-download-intermediate-model-parallel-tutorial-py)下载完整示例代码
 - en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
+  id: totrans-4
  prefs: []
  type: TYPE_NORMAL
+  zh: '**作者**：[Shen Li](https://mrshenli.github.io/)'
 - en: 'Model parallel is widely-used in distributed training techniques. Previous
    posts have explained how to use [DataParallel](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html)
    to train a neural network on multiple GPUs; this feature replicates the same model
@@ -27,8 +37,10 @@
    the entire model on each GPU (to be concrete, say a model `m` contains 10 layers:
    when using `DataParallel`, each GPU will have a replica of each of these 10 layers,
    whereas when using model parallel on two GPUs, each GPU could host 5 layers).'
+  id: totrans-5
  prefs: []
  type: TYPE_NORMAL
+  zh: 模型并行在分布式训练技术中被广泛使用。先前的帖子已经解释了如何使用[DataParallel](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html)在多个GPU上训练神经网络；这个功能将相同的模型复制到所有GPU上，每个GPU消耗不同的输入数据分区。虽然它可以显著加速训练过程，但对于一些模型太大无法适应单个GPU的情况，它无法工作。这篇文章展示了如何通过使用**模型并行**来解决这个问题，与`DataParallel`相反，它将单个模型分割到不同的GPU上，而不是在每个GPU上复制整个模型（具体来说，假设一个模型`m`包含10层：使用`DataParallel`时，每个GPU将有这10层的副本，而使用两个GPU上的模型并行时，每个GPU可以承载5层）。
 - en: The high-level idea of model parallel is to place different sub-networks of
    a model onto different devices, and implement the `forward` method accordingly
    to move intermediate outputs across devices. As only part of a model operates
@@ -36,84 +48,118 @@
    In this post, we will not try to construct huge models and squeeze them into a
    limited number of GPUs. Instead, this post focuses on showing the idea of model
    parallel. It is up to the readers to apply the ideas to real-world applications.
+  id: totrans-6
  prefs: []
  type: TYPE_NORMAL
+  zh: 模型并行的高级思想是将模型的不同子网络放置在不同的设备上，并相应地实现`forward`方法以在设备之间传递中间输出。由于模型的部分在任何单独的设备上运行，一组设备可以共同为一个更大的模型提供服务。在这篇文章中，我们不会尝试构建庞大的模型并将它们压缩到有限数量的GPU中。相反，这篇文章侧重于展示模型并行的思想。读者可以将这些思想应用到现实世界的应用中。
 - en: Note
+  id: totrans-7
  prefs: []
  type: TYPE_NORMAL
+  zh: 注意
 - en: For distributed model parallel training where a model spans multiple servers,
    please refer to [Getting Started With Distributed RPC Framework](rpc_tutorial.html)
    for examples and details.
+  id: totrans-8
  prefs: []
  type: TYPE_NORMAL
+  zh: 对于跨多个服务器的分布式模型并行训练，请参考[使用分布式RPC框架入门](rpc_tutorial.html)以获取示例和详细信息。
 - en: Basic Usage
+  id: totrans-9
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 基本用法
 - en: Let us start with a toy model that contains two linear layers. To run this model
    on two GPUs, simply put each linear layer on a different GPU, and move inputs
    and intermediate outputs to match the layer devices accordingly.
+  id: totrans-10
  prefs: []
  type: TYPE_NORMAL
+  zh: 让我们从一个包含两个线性层的玩具模型开始。要在两个GPU上运行这个模型，只需将每个线性层放在不同的GPU上，并将输入和中间输出移动到匹配层设备的位置。
 - en: '[PRE0]'
+  id: totrans-11
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE0]'
 - en: Note that, the above `ToyModel` looks very similar to how one would implement
    it on a single GPU, except the four `to(device)` calls which place linear layers
    and tensors on proper devices. That is the only place in the model that requires
    changes. The `backward()` and `torch.optim` will automatically take care of gradients
    as if the model is on one GPU. You only need to make sure that the labels are
    on the same device as the outputs when calling the loss function.
+  id: totrans-12
  prefs: []
  type: TYPE_NORMAL
+  zh: 请注意，上面的`ToyModel`看起来与在单个GPU上实现它的方式非常相似，除了四个`to(device)`调用，这些调用将线性层和张量放置在适当的设备上。这是模型中唯一需要更改的地方。`backward()`和`torch.optim`将自动处理梯度，就好像模型在一个GPU上一样。您只需要确保在调用损失函数时标签与输出在同一设备上。
 - en: '[PRE1]'
+  id: totrans-13
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE1]'
 - en: Apply Model Parallel to Existing Modules
+  id: totrans-14
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 将模型并行应用于现有模块
 - en: It is also possible to run an existing single-GPU module on multiple GPUs with
    just a few lines of changes. The code below shows how to decompose `torchvision.models.resnet50()`
    to two GPUs. The idea is to inherit from the existing `ResNet` module, and split
    the layers to two GPUs during construction. Then, override the `forward` method
    to stitch two sub-networks by moving the intermediate outputs accordingly.
+  id: totrans-15
  prefs: []
  type: TYPE_NORMAL
+  zh: 也可以通过只更改几行代码在多个GPU上运行现有的单GPU模块。下面的代码显示了如何将`torchvision.models.resnet50()`分解为两个GPU。思路是继承现有的`ResNet`模块，并在构造过程中将层分割到两个GPU上。然后，重写`forward`方法，通过相应地移动中间输出来拼接两个子网络。
 - en: '[PRE2]'
+  id: totrans-16
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE2]'
 - en: The above implementation solves the problem for cases where the model is too
    large to fit into a single GPU. However, you might have already noticed that it
    will be slower than running it on a single GPU if your model fits. It is because,
    at any point in time, only one of the two GPUs are working, while the other one
    is sitting there doing nothing. The performance further deteriorates as the intermediate
    outputs need to be copied from `cuda:0` to `cuda:1` between `layer2` and `layer3`.
+  id: totrans-17
  prefs: []
  type: TYPE_NORMAL
+  zh: 上述实现解决了模型过大无法适应单个GPU的情况。然而，您可能已经注意到，如果您的模型适合单个GPU，则运行速度会比在单个GPU上运行要慢。这是因为，在任何时候，只有两个GPU中的一个在工作，而另一个则闲置。性能进一步恶化，因为需要在`layer2`和`layer3`之间将中间输出从`cuda:0`复制到`cuda:1`。
 - en: Let us run an experiment to get a more quantitative view of the execution time.
    In this experiment, we train `ModelParallelResNet50` and the existing `torchvision.models.resnet50()`
    by running random inputs and labels through them. After the training, the models
    will not produce any useful predictions, but we can get a reasonable understanding
    of the execution times.
+  id: totrans-18
  prefs: []
  type: TYPE_NORMAL
+  zh: 让我们进行一个实验，以更量化地了解执行时间。在这个实验中，我们通过将随机输入和标签传递给它们来训练`ModelParallelResNet50`和现有的`torchvision.models.resnet50()`。训练之后，模型将不会产生任何有用的预测，但我们可以对执行时间有一个合理的了解。
 - en: '[PRE3]'
+  id: totrans-19
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE3]'
 - en: The `train(model)` method above uses `nn.MSELoss` as the loss function, and
    `optim.SGD` as the optimizer. It mimics training on `128 X 128` images which are
    organized into 3 batches where each batch contains 120 images. Then, we use `timeit`
    to run the `train(model)` method 10 times and plot the execution times with standard
    deviations.
+  id: totrans-20
  prefs: []
  type: TYPE_NORMAL
+  zh: '上面的`train(model)`方法使用`nn.MSELoss`作为损失函数，使用`optim.SGD`作为优化器。它模拟对`128 X 128`图像进行训练，这些图像被组织成3个批次，每个批次包含120张图像。然后，我们使用`timeit`运行`train(model)`方法10次，并绘制带有标准偏差的执行时间。 '
 - en: '[PRE4]'
+  id: totrans-21
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE4]'
 - en: '![](../Images/7f2d776cf49fcf3fd44fd84a238a3cc6.png)'
+  id: totrans-22
  prefs: []
  type: TYPE_IMG
+  zh: '![](../Images/7f2d776cf49fcf3fd44fd84a238a3cc6.png)'
 - en: The result shows that the execution time of model parallel implementation is
    `4.02/3.75-1=7%` longer than the existing single-GPU implementation. So we can
    conclude there is roughly 7% overhead in copying tensors back and forth across
@@ -122,20 +168,28 @@
    into a pipeline of splits, such that when one split reaches the second sub-network,
    the following split can be fed into the first sub-network. In this way, two consecutive
    splits can run concurrently on two GPUs.
+  id: totrans-23
  prefs: []
  type: TYPE_NORMAL
+  zh: 结果显示，模型并行实现的执行时间比现有的单GPU实现长了`4.02/3.75-1=7%`。因此，我们可以得出结论，在跨GPU传输张量时大约有7%的开销。还有改进的空间，因为我们知道两个GPU中的一个在整个执行过程中处于空闲状态。一种选择是将每个批次进一步分成一系列分割的管道，这样当一个分割到达第二个子网络时，接下来的分割可以被送入第一个子网络。这样，两个连续的分割可以在两个GPU上同时运行。
 - en: Speed Up by Pipelining Inputs
+  id: totrans-24
  prefs:
  - PREF_H2
  type: TYPE_NORMAL
+  zh: 通过流水线输入加速
 - en: In the following experiments, we further divide each 120-image batch into 20-image
    splits. As PyTorch launches CUDA operations asynchronously, the implementation
    does not need to spawn multiple threads to achieve concurrency.
+  id: totrans-25
  prefs: []
  type: TYPE_NORMAL
+  zh: 在以下实验中，我们将每个120张图像批次进一步分成20张图像的拆分。由于PyTorch异步启动CUDA操作，实现不需要生成多个线程来实现并发。
 - en: '[PRE5]'
+  id: totrans-26
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE5]'
 - en: Please note, device-to-device tensor copy operations are synchronized on current
    streams on the source and the destination devices. If you create multiple streams,
    you have to make sure that copy operations are properly synchronized. Writing
@@ -143,11 +197,15 @@
    copy operation can lead to undefined behavior. The above implementation only uses
    default streams on both source and destination devices, hence it is not necessary
    to enforce additional synchronizations.
+  id: totrans-27
  prefs: []
  type: TYPE_NORMAL
+  zh: 请注意，设备之间的张量复制操作在源设备和目标设备上的当前流上是同步的。如果您创建多个流，您必须确保复制操作得到适当的同步。在完成复制操作之前写入源张量或读取/写入目标张量可能导致未定义的行为。上述实现仅在源设备和目标设备上使用默认流，因此不需要强制执行额外的同步。
 - en: '![](../Images/48d2e67f025b05eeb9259e249566add3.png)'
+  id: totrans-28
  prefs: []
  type: TYPE_IMG
+  zh: '![](../Images/48d2e67f025b05eeb9259e249566add3.png)'
 - en: The experiment result shows that, pipelining inputs to model parallel ResNet50
    speeds up the training process by roughly `3.75/2.51-1=49%`. It is still quite
    far away from the ideal 100% speedup. As we have introduced a new parameter `split_sizes`
@@ -157,12 +215,16 @@
    long idle times during the first and last splits. Neither are optimal. There might
    be an optimal `split_size` configuration for this specific experiment. Let us
    try to find it by running experiments using several different `split_size` values.
+  id: totrans-29
  prefs: []
  type: TYPE_NORMAL
 - en: '[PRE6]'
+  id: totrans-30
  prefs: []
  type: TYPE_PRE
+  zh: '[PRE6]'
 - en: '![](../Images/9d53a7aba4b9016ea39aa794905ee059.png)'
+  id: totrans-31
  prefs: []
  type: TYPE_IMG
 - en: The result shows that setting `split_size` to 12 achieves the fastest training
@@ -175,27 +237,41 @@
    both GPUs, and different sub-network structures require different stream management
    strategies. As no general multi-stream solution works for all model parallel use
    cases, we will not discuss it in this tutorial.
+  id: totrans-32
  prefs: []
  type: TYPE_NORMAL
+  zh: 结果显示，将`split_size`设置为12可以实现最快的训练速度，从而导致`3.75/2.43-1=54%`的加速。仍然有机会进一步加快训练过程。例如，所有在`cuda:0`上的操作都放在其默认流中。这意味着下一个分割的计算不能与`prev`分割的复制操作重叠。然而，由于`prev`和下一个分割是不同的张量，因此可以将一个的计算与另一个的复制重叠。实现需要在两个GPU上使用多个流，不同的子网络结构需要不同的流管理策略。由于没有通用的多流解决方案适用于所有模型并行使用情况，我们在本教程中不会讨论这个问题。
 - en: '**Note:**'
+  id: totrans-33
  prefs: []
  type: TYPE_NORMAL
+  zh: '**注意：**'
 - en: This post shows several performance measurements. You might see different numbers
    when running the same code on your own machine, because the result depends on
    the underlying hardware and software. To get the best performance for your environment,
    a proper approach is to first generate the curve to figure out the best split
    size, and then use that split size to pipeline inputs.
+  id: totrans-34
  prefs: []
  type: TYPE_NORMAL
+  zh: 本文展示了几个性能测量。当在您自己的机器上运行相同的代码时，您可能会看到不同的数字，因为结果取决于底层硬件和软件。为了在您的环境中获得最佳性能，一个正确的方法是首先生成曲线以找出最佳的拆分大小，然后使用该拆分大小来流水线输入。
 - en: '**Total running time of the script:** ( 5 minutes 48.653 seconds)'
+  id: totrans-35
  prefs: []
  type: TYPE_NORMAL
+  zh: '**脚本的总运行时间：**（5分钟48.653秒）'
 - en: '[`Download Python source code: model_parallel_tutorial.py`](../_downloads/84ab670fda2216116ac8e3ecd5805f0b/model_parallel_tutorial.py)'
+  id: totrans-36
  prefs: []
  type: TYPE_NORMAL
+  zh: '[`下载Python源代码：model_parallel_tutorial.py`](../_downloads/84ab670fda2216116ac8e3ecd5805f0b/model_parallel_tutorial.py)'
 - en: '[`Download Jupyter notebook: model_parallel_tutorial.ipynb`](../_downloads/03a48646520c277662581e858e680809/model_parallel_tutorial.ipynb)'
+  id: totrans-37
  prefs: []
  type: TYPE_NORMAL
+  zh: '[`下载Jupyter笔记本：model_parallel_tutorial.ipynb`](../_downloads/03a48646520c277662581e858e680809/model_parallel_tutorial.ipynb)'
 - en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
+  id: totrans-38
  prefs: []
  type: TYPE_NORMAL
+  zh: '[Sphinx-Gallery生成的画廊](https://sphinx-gallery.github.io)'