提交 8d4a382e 编写于 作者: 绝不原创的飞龙's avatar 绝不原创的飞龙

2024-02-04 14:59:22

上级 a60b5fd8
......@@ -218,6 +218,7 @@
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 实验结果显示,将输入流水线传输到模型并行的ResNet50可以将训练过程加速大约`3.75/2.51-1=49%`。这仍然远远落后于理想的100%加速。由于我们在管道并行实现中引入了一个新参数`split_sizes`,目前还不清楚这个新参数如何影响整体训练时间。直觉上,使用较小的`split_size`会导致许多小的CUDA内核启动,而使用较大的`split_size`会导致在第一个和最后一个分割期间相对较长的空闲时间。两者都不是最佳选择。对于这个特定实验,可能存在一个最佳的`split_size`配置。让我们通过运行使用几个不同`split_size`值的实验来尝试找到它。
- en: '[PRE6]'
id: totrans-30
prefs: []
......@@ -227,6 +228,7 @@
id: totrans-31
prefs: []
type: TYPE_IMG
zh: 把这个文件夹拖到另一个文件夹中。
- en: The result shows that setting `split_size` to 12 achieves the fastest training
speed, which leads to `3.75/2.43-1=54%` speedup. There are still opportunities
to further accelerate the training process. For example, all operations on `cuda:0`
......
- en: Getting Started with Distributed Data Parallel
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 开始使用分布式数据并行
- en: 原文:[https://pytorch.org/tutorials/intermediate/ddp_tutorial.html](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: '[https://pytorch.org/tutorials/intermediate/ddp_tutorial.html](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)'
- en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Shen Li](https://mrshenli.github.io/)'
- en: '**Edited by**: [Joe Zhu](https://github.com/gunandrose4u)'
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: '**编辑者**:[Joe Zhu](https://github.com/gunandrose4u)'
- en: Note
id: totrans-4
prefs: []
type: TYPE_NORMAL
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst).'
zh: 注意
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst).'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 查看并编辑此教程在[github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/ddp_tutorial.rst)。
- en: 'Prerequisites:'
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: '先决条件:'
- en: '[PyTorch Distributed Overview](../beginner/dist_overview.html)'
id: totrans-7
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[PyTorch 分布式概述](../beginner/dist_overview.html)'
- en: '[DistributedDataParallel API documents](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html)'
id: totrans-8
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[DistributedDataParallel API 文档](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html)'
- en: '[DistributedDataParallel notes](https://pytorch.org/docs/master/notes/ddp.html)'
id: totrans-9
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[DistributedDataParallel 笔记](https://pytorch.org/docs/master/notes/ddp.html)'
- en: '[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#module-torch.nn.parallel)
(DDP) implements data parallelism at the module level which can run across multiple
machines. Applications using DDP should spawn multiple processes and create a
......@@ -44,73 +64,103 @@
DDP uses that signal to trigger gradient synchronization across processes. Please
refer to [DDP design note](https://pytorch.org/docs/master/notes/ddp.html) for
more details.'
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: '[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#module-torch.nn.parallel)(DDP)在模块级别实现了数据并行,可以在多台机器上运行。使用DDP的应用程序应该生成多个进程,并为每个进程创建一个单独的DDP实例。DDP使用[torch.distributed](https://pytorch.org/tutorials/intermediate/dist_tuto.html)包中的集体通信来同步梯度和缓冲区。更具体地说,DDP为`model.parameters()`给定的每个参数注册一个自动求导钩子,当在反向传播中计算相应的梯度时,该钩子将触发。然后DDP使用该信号来触发跨进程的梯度同步。更多详细信息请参考[DDP设计说明](https://pytorch.org/docs/master/notes/ddp.html)。'
- en: The recommended way to use DDP is to spawn one process for each model replica,
where a model replica can span multiple devices. DDP processes can be placed on
the same machine or across machines, but GPU devices cannot be shared across processes.
This tutorial starts from a basic DDP use case and then demonstrates more advanced
use cases including checkpointing models and combining DDP with model parallel.
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 使用DDP的推荐方式是为每个模型副本生成一个进程,其中一个模型副本可以跨多个设备。DDP进程可以放置在同一台机器上或跨多台机器,但GPU设备不能在进程之间共享。本教程从基本的DDP用例开始,然后演示更高级的用例,包括模型检查点和将DDP与模型并行结合使用。
- en: Note
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: The code in this tutorial runs on an 8-GPU server, but it can be easily generalized
to other environments.
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 本教程中的代码在一个8-GPU服务器上运行,但可以很容易地推广到其他环境。
- en: Comparison between `DataParallel` and `DistributedDataParallel`
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: '`DataParallel`和`DistributedDataParallel`之间的比较'
- en: 'Before we dive in, let’s clarify why, despite the added complexity, you would
consider using `DistributedDataParallel` over `DataParallel`:'
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 在我们深入讨论之前,让我们澄清一下为什么尽管增加了复杂性,你会考虑使用`DistributedDataParallel`而不是`DataParallel`:
- en: First, `DataParallel` is single-process, multi-thread, and only works on a single
machine, while `DistributedDataParallel` is multi-process and works for both single-
and multi- machine training. `DataParallel` is usually slower than `DistributedDataParallel`
even on a single machine due to GIL contention across threads, per-iteration replicated
model, and additional overhead introduced by scattering inputs and gathering outputs.
id: totrans-16
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 首先,`DataParallel` 是单进程、多线程的,仅适用于单台机器,而 `DistributedDataParallel` 是多进程的,适用于单机和多机训练。由于线程之间的
GIL 冲突、每次迭代复制模型以及输入散布和输出聚集引入的额外开销,即使在单台机器上,`DataParallel` 通常比 `DistributedDataParallel`
慢。
- en: Recall from the [prior tutorial](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
that if your model is too large to fit on a single GPU, you must use **model parallel**
to split it across multiple GPUs. `DistributedDataParallel` works with **model
parallel**; `DataParallel` does not at this time. When DDP is combined with model
parallel, each DDP process would use model parallel, and all processes collectively
would use data parallel.
id: totrans-17
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 回想一下从[之前的教程](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)中得知,如果你的模型太大无法放入单个GPU中,你必须使用**模型并行**将其分割到多个GPU上。`DistributedDataParallel`与**模型并行**一起工作;`DataParallel`目前不支持。当DDP与模型并行结合时,每个DDP进程都会使用模型并行,所有进程共同使用数据并行。
- en: If your model needs to span multiple machines or if your use case does not fit
into data parallelism paradigm, please see [the RPC API](https://pytorch.org/docs/stable/rpc.html)
for more generic distributed training support.
id: totrans-18
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: 如果您的模型需要跨多台机器,或者您的用例不适合数据并行主义范式,请参阅[RPC API](https://pytorch.org/docs/stable/rpc.html)以获取更通用的分布式训练支持。
- en: Basic Use Case
id: totrans-19
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 基本用例
- en: To create a DDP module, you must first set up process groups properly. More
details can be found in [Writing Distributed Applications with PyTorch](https://pytorch.org/tutorials/intermediate/dist_tuto.html).
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 要创建一个DDP模块,你必须首先正确设置进程组。更多细节可以在[使用PyTorch编写分布式应用程序](https://pytorch.org/tutorials/intermediate/dist_tuto.html)中找到。
- en: '[PRE0]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Now, let’s create a toy module, wrap it with DDP, and feed it some dummy input
data. Please note, as DDP broadcasts model states from rank 0 process to all other
processes in the DDP constructor, you do not need to worry about different DDP
processes starting from different initial model parameter values.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 现在,让我们创建一个玩具模块,用DDP包装它,并提供一些虚拟输入数据。请注意,由于DDP在构造函数中从rank 0进程向所有其他进程广播模型状态,您不需要担心不同的DDP进程从不同的初始模型参数值开始。
- en: '[PRE1]'
id: totrans-23
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: As you can see, DDP wraps lower-level distributed communication details and
provides a clean API as if it were a local model. Gradient synchronization communications
take place during the backward pass and overlap with the backward computation.
......@@ -118,12 +168,16 @@
gradient tensor. For basic use cases, DDP only requires a few more LoCs to set
up the process group. When applying DDP to more advanced use cases, some caveats
require caution.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 正如您所看到的,DDP封装了较低级别的分布式通信细节,并提供了一个干净的API,就像它是一个本地模型一样。梯度同步通信发生在反向传播过程中,并与反向计算重叠。当`backward()`返回时,`param.grad`已经包含了同步的梯度张量。对于基本用例,DDP只需要几行额外的代码来设置进程组。当将DDP应用于更高级的用例时,一些注意事项需要谨慎处理。
- en: Skewed Processing Speeds
id: totrans-25
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 处理速度不均衡
- en: In DDP, the constructor, the forward pass, and the backward pass are distributed
synchronization points. Different processes are expected to launch the same number
of synchronizations and reach these synchronization points in the same order and
......@@ -133,12 +187,16 @@
skewed processing speeds are inevitable due to, e.g., network delays, resource
contentions, or unpredictable workload spikes. To avoid timeouts in these situations,
make sure that you pass a sufficiently large `timeout` value when calling [init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 在DDP中,构造函数、前向传递和后向传递是分布式同步点。预期不同的进程将启动相同数量的同步,并按相同顺序到达这些同步点,并在大致相同的时间进入每个同步点。否则,快速进程可能会提前到达并在等待滞后者时超时。因此,用户负责在进程之间平衡工作负载分布。有时,由于网络延迟、资源竞争或不可预测的工作负载波动等原因,不可避免地会出现处理速度不均衡的情况。为了避免在这些情况下超时,请确保在调用[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)时传递一个足够大的`timeout`值。
- en: Save and Load Checkpoints
id: totrans-27
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 保存和加载检查点
- en: It’s common to use `torch.save` and `torch.load` to checkpoint modules during
training and recover from checkpoints. See [SAVING AND LOADING MODELS](https://pytorch.org/tutorials/beginner/saving_loading_models.html)
for more details. When using DDP, one optimization is to save the model in only
......@@ -153,72 +211,110 @@
saved, which would result in all processes on the same machine using the same
set of devices. For more advanced failure recovery and elasticity support, please
refer to [TorchElastic](https://pytorch.org/elastic).
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 在训练过程中,通常使用`torch.save`和`torch.load`来对模块进行检查点,并从检查点中恢复。有关更多详细信息,请参阅[SAVING AND
LOADING MODELS](https://pytorch.org/tutorials/beginner/saving_loading_models.html)。在使用DDP时,一种优化是在一个进程中保存模型,然后加载到所有进程中,减少写入开销。这是正确的,因为所有进程都从相同的参数开始,并且在反向传递中梯度是同步的,因此优化器应该保持将参数设置为相同的值。如果使用此优化,请确保在保存完成之前没有进程开始加载。此外,在加载模块时,您需要提供一个适当的`map_location`参数,以防止一个进程进入其他设备。如果缺少`map_location`,`torch.load`将首先将模块加载到CPU,然后将每个参数复制到保存的位置,这将导致同一台机器上的所有进程使用相同的设备集。有关更高级的故障恢复和弹性支持,请参阅[TorchElastic](https://pytorch.org/elastic)。
- en: '[PRE2]'
id: totrans-29
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: Combining DDP with Model Parallelism
id: totrans-30
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 将DDP与模型并行结合起来
- en: DDP also works with multi-GPU models. DDP wrapping multi-GPU models is especially
helpful when training large models with a huge amount of data.
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: DDP也适用于多GPU模型。在训练大型模型和大量数据时,DDP包装多GPU模型尤其有帮助。
- en: '[PRE3]'
id: totrans-32
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: When passing a multi-GPU model to DDP, `device_ids` and `output_device` must
NOT be set. Input and output data will be placed in proper devices by either the
application or the model `forward()` method.
id: totrans-33
prefs: []
type: TYPE_NORMAL
zh: 当将多GPU模型传递给DDP时,`device_ids`和`output_device`必须不设置。输入和输出数据将由应用程序或模型的`forward()`方法放置在适当的设备上。
- en: '[PRE4]'
id: totrans-34
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: Initialize DDP with torch.distributed.run/torchrun
id: totrans-35
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 使用torch.distributed.run/torchrun初始化DDP
- en: We can leverage PyTorch Elastic to simplify the DDP code and initialize the
job more easily. Let’s still use the Toymodel example and create a file named
`elastic_ddp.py`.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 我们可以利用PyTorch Elastic来简化DDP代码并更轻松地初始化作业。让我们仍然使用Toymodel示例并创建一个名为`elastic_ddp.py`的文件。
- en: '[PRE5]'
id: totrans-37
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: 'One can then run a [torch elastic/torchrun](https://pytorch.org/docs/stable/elastic/quickstart.html)
command on all nodes to initialize the DDP job created above:'
id: totrans-38
prefs: []
type: TYPE_NORMAL
zh: 然后可以在所有节点上运行 [torch elastic/torchrun](https://pytorch.org/docs/stable/elastic/quickstart.html)
命令来初始化上面创建的DDP作业:
- en: '[PRE6]'
id: totrans-39
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: We are running the DDP script on two hosts, and each host we run with 8 processes,
aka, we are running it on 16 GPUs. Note that `$MASTER_ADDR` must be the same across
all nodes.
id: totrans-40
prefs: []
type: TYPE_NORMAL
zh: 我们在两台主机上运行DDP脚本,每台主机运行8个进程,也就是说我们在16个GPU上运行它。请注意,`$MASTER_ADDR`在所有节点上必须相同。
- en: Here torchrun will launch 8 process and invoke `elastic_ddp.py` on each process
on the node it is launched on, but user also needs to apply cluster management
tools like slurm to actually run this command on 2 nodes.
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: torchrun将启动8个进程,并在启动它的节点上的每个进程上调用`elastic_ddp.py`,但用户还需要应用类似slurm的集群管理工具来实际在2个节点上运行此命令。
- en: 'For example, on a SLURM enabled cluster, we can write a script to run the command
above and set `MASTER_ADDR` as:'
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 例如,在启用了SLURM的集群上,我们可以编写一个脚本来运行上面的命令,并将`MASTER_ADDR`设置为:
- en: '[PRE7]'
id: totrans-43
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: 'Then we can just run this script using the SLURM command: `srun --nodes=2 ./torchrun_script.sh`.
Of course, this is just an example; you can choose your own cluster scheduling
tools to initiate the torchrun job.'
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: 然后我们可以使用SLURM命令运行此脚本:`srun --nodes=2 ./torchrun_script.sh`。当然,这只是一个例子;您可以选择自己的集群调度工具来启动torchrun作业。
- en: For more information about Elastic run, one can check this [quick start document](https://pytorch.org/docs/stable/elastic/quickstart.html)
to learn more.
id: totrans-45
prefs: []
type: TYPE_NORMAL
zh: 关于Elastic run 的更多信息,可以查看这个[快速入门文档](https://pytorch.org/docs/stable/elastic/quickstart.html)以了解更多。
此差异已折叠。
此差异已折叠。
此差异已折叠。
- en: Customize Process Group Backends Using Cpp Extensions
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Cpp扩展自定义流程组后端
- en: 原文:[https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html](https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: '[https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html](https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html)'
- en: '**Author**: Howard Huang <https://github.com/H-Huang>, [Feng Tian](https://github.com/ftian1),
[Shen Li](https://mrshenli.github.io/), [Min Si](https://minsii.github.io/)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '**作者**:Howard Huang <https://github.com/H-Huang>,[Feng Tian](https://github.com/ftian1),[Shen
Li](https://mrshenli.github.io/),[Min Si](https://minsii.github.io/)'
- en: Note
id: totrans-3
prefs: []
type: TYPE_NORMAL
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst).'
zh: 注意
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/process_group_cpp_extension_tutorial.rst)
上查看并编辑本教程。'
- en: 'Prerequisites:'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '先决条件:'
- en: '[PyTorch Distributed Overview](../beginner/dist_overview.html)'
id: totrans-6
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[PyTorch 分布式概述](../beginner/dist_overview.html)'
- en: '[PyTorch Collective Communication Package](https://pytorch.org/docs/stable/distributed.html)'
id: totrans-7
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[PyTorch 集体通信包](https://pytorch.org/docs/stable/distributed.html)'
- en: '[PyTorch Cpp Extension](https://pytorch.org/docs/stable/cpp_extension.html)'
id: totrans-8
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[PyTorch Cpp 扩展](https://pytorch.org/docs/stable/cpp_extension.html)'
- en: '[Writing Distributed Applications with PyTorch](https://pytorch.org/tutorials/intermediate/dist_tuto.html)'
id: totrans-9
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[使用 PyTorch 编写分布式应用程序](https://pytorch.org/tutorials/intermediate/dist_tuto.html)'
- en: This tutorial demonstrates how to implement a custom `Backend` and plug that
into [PyTorch distributed package](https://pytorch.org/docs/stable/distributed.html)
using [cpp extensions](https://pytorch.org/docs/stable/cpp_extension.html). This
is helpful when you need a specialized software stack for your hardware, or when
you would like to experiment with new collective communication algorithms.
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 本教程演示了如何实现一个自定义的`Backend`并将其插入[PyTorch分布式包](https://pytorch.org/docs/stable/distributed.html),使用[cpp扩展](https://pytorch.org/docs/stable/cpp_extension.html)。当您需要为硬件定制专门的软件堆栈,或者想要尝试新的集体通信算法时,这将非常有帮助。
- en: Basics
id: totrans-11
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 基础知识
- en: PyTorch collective communications power several widely adopted distributed training
features, including [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html),
[ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer),
......@@ -64,19 +90,26 @@
[Reduction Server](https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai)).
Therefore, the distributed package exposes extension APIs to allow customizing
collective communication backends.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: PyTorch集体通信支持多种广泛采用的分布式训练功能,包括[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html),[ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html#torch.distributed.optim.ZeroRedundancyOptimizer),[FullyShardedDataParallel](https://github.com/pytorch/pytorch/blob/master/torch/distributed/_fsdp/fully_sharded_data_parallel.py)。为了使相同的集体通信API能够与不同的通信后端一起工作,分布式包将集体通信操作抽象为[Backend](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Backend.hpp)类。不同的后端可以作为`Backend`的子类使用首选的第三方库来实现。PyTorch分布式带有三个默认后端,`ProcessGroupNCCL`,`ProcessGroupGloo`和`ProcessGroupMPI`。然而,除了这三个后端之外,还有其他通信库(例如[UCC](https://github.com/openucx/ucc),[OneCCL](https://github.com/oneapi-src/oneCCL)),不同类型的硬件(例如[TPU](https://cloud.google.com/tpu),[Trainum](https://aws.amazon.com/machine-learning/trainium/))和新兴的通信算法(例如[Herring](https://www.amazon.science/publications/herring-rethinking-the-parameter-server-at-scale-for-the-cloud),[Reduction
Server](https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai))。因此,分布式包提供了扩展API来允许定制集体通信后端。
- en: The 4 steps below show how to implement a dummy `Backend` backend and use that
in Python application code. Please note that this tutorial focuses on demonstrating
the extension APIs, instead of developing a functioning communication backend.
Hence, the `dummy` backend just covers a subset of the APIs (`all_reduce` and
`all_gather`), and simply sets the values of tensors to 0.
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 以下4个步骤展示了如何在Python应用程序代码中实现一个虚拟的`Backend`后端并使用它。请注意,本教程侧重于演示扩展API,而不是开发一个功能完善的通信后端。因此,`dummy`后端只涵盖了API的一个子集(`all_reduce`和`all_gather`),并且只是将张量的值设置为0。
- en: 'Step 1: Implement a Subclass of `Backend`'
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤1:实现`Backend`的子类
- en: This first step is to implement a `Backend` subclass that overrides target collective
communication APIs and runs the custom communication algorithm. The extension
also needs to implement a `Work` subclass, which serves as a future of communication
......@@ -85,68 +118,102 @@
APIs from the `BackendDummy` subclass. The two code snippets below present the
implementation of `dummy.h` and `dummy.cpp`. See the [dummy collectives](https://github.com/H-Huang/torch_collective_extension)
repository for the full implementation.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 第一步是实现一个`Backend`子类,覆盖目标集体通信API,并运行自定义通信算法。扩展还需要实现一个`Work`子类,作为通信结果的future,并允许在应用代码中异步执行。如果扩展使用第三方库,可以在`BackendDummy`子类中包含头文件并调用库API。下面的两个代码片段展示了`dummy.h`和`dummy.cpp`的实现。请查看[dummy
collectives](https://github.com/H-Huang/torch_collective_extension)存储库以获取完整的实现。
- en: '[PRE0]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '[PRE1]'
id: totrans-17
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: 'Step 2: Expose The Extension Python APIs'
id: totrans-18
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤2:暴露扩展Python API
- en: The backend constructors are called [from Python side](https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/distributed_c10d.py#L643-L650),
so the extension also needs to expose the constructor APIs to Python. This can
be done by adding the following methods. In this example, `store` and `timeout`
are ignored by the `BackendDummy` instantiation method, as those are not used
in this dummy implementation. However, real-world extensions should consider using
the `store` to perform rendezvous and supporting the `timeout` argument.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 后端构造函数是从Python端调用的,因此扩展还需要向Python公开构造函数API。这可以通过添加以下方法来实现。在这个例子中,`store`和`timeout`被`BackendDummy`实例化方法忽略,因为在这个虚拟实现中没有使用它们。然而,真实世界的扩展应该考虑使用`store`来执行会合并支持`timeout`参数。
- en: '[PRE2]'
id: totrans-20
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: '[PRE3]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: 'Step 3: Build The Custom Extension'
id: totrans-22
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤3:构建自定义扩展
- en: Now, the extension source code files are ready. We can then use [cpp extensions](https://pytorch.org/docs/stable/cpp_extension.html)
to build it. To do that, create a `setup.py` file that prepares the paths and
commands. Then call `python setup.py develop` to install the extension.
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 现在,扩展源代码文件已经准备好。我们可以使用[cpp extensions](https://pytorch.org/docs/stable/cpp_extension.html)来构建它。为此,创建一个`setup.py`文件,准备路径和命令。然后调用`python
setup.py develop`来安装扩展。
- en: If the extension depends on third-party libraries, you can also specify `libraries_dirs`
and `libraries` to the cpp extension APIs. See the [torch ucc](https://github.com/openucx/torch-ucc)
project as a real-world example.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 如果扩展依赖于第三方库,您还可以在cpp扩展API中指定`libraries_dirs`和`libraries`。请参考[torch ucc](https://github.com/openucx/torch-ucc)项目作为一个真实的例子。
- en: '[PRE4]'
id: totrans-25
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: 'Step 4: Use The Extension in Application'
id: totrans-26
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤4:在应用程序中使用扩展。
- en: After installation, you can conveniently use the `dummy` backend when calling
[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)
as if it is an builtin backend.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 安装完成后,您可以在调用[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)时方便地使用`dummy`后端,就像它是一个内置后端一样。
- en: We can specify dispatching based on backend by changing the `backend` argument
of `init_process_group`. We can dispatch collective with CPU tensor to `gloo`
backend and dispatch collective with CUDA tensor to `dummy` backend by specifying
`cpu:gloo,cuda:dummy` as the backend argument.
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 我们可以根据后端来指定调度,方法是改变`init_process_group`的`backend`参数。我们可以通过将后端参数指定为`cpu:gloo,cuda:dummy`,将CPU张量的集体分发到`gloo`后端,将CUDA张量的集体分发到`dummy`后端。
- en: To send all tensors to `dummy` backend, we can simply specify `dummy` as the
backend argument.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 要将所有张量发送到“dummy”后端,我们可以简单地将“dummy”指定为后端参数。
- en: '[PRE5]'
id: totrans-30
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
此差异已折叠。
此差异已折叠。
- en: Distributed Pipeline Parallelism Using RPC
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用RPC进行分布式管道并行
- en: 原文:[https://pytorch.org/tutorials/intermediate/dist_pipeline_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/dist_pipeline_parallel_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/intermediate/dist_pipeline_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/dist_pipeline_parallel_tutorial.html)
- en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 作者:[Shen Li](https://mrshenli.github.io/)
- en: Note
id: totrans-3
prefs: []
type: TYPE_NORMAL
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_pipeline_parallel_tutorial.rst).'
zh: 注意
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_pipeline_parallel_tutorial.rst).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) 在[github](https://github.com/pytorch/tutorials/blob/main/intermediate_source/dist_pipeline_parallel_tutorial.rst)中查看并编辑本教程。'
- en: 'Prerequisites:'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 先决条件:
- en: '[PyTorch Distributed Overview](../beginner/dist_overview.html)'
id: totrans-6
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[PyTorch分布式概述](../beginner/dist_overview.html)'
- en: '[Single-Machine Model Parallel Best Practices](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)'
id: totrans-7
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[单机模型并行最佳实践](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)'
- en: '[Getting started with Distributed RPC Framework](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html)'
id: totrans-8
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[开始使用分布式RPC框架](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html)'
- en: 'RRef helper functions: [RRef.rpc_sync()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_sync),
[RRef.rpc_async()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_async),
and [RRef.remote()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.remote)'
id: totrans-9
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: RRef辅助函数:[RRef.rpc_sync()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_sync)、[RRef.rpc_async()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_async)和[RRef.remote()](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.remote)
- en: This tutorial uses a Resnet50 model to demonstrate implementing distributed
pipeline parallelism with [torch.distributed.rpc](https://pytorch.org/docs/master/rpc.html)
APIs. This can be viewed as the distributed counterpart of the multi-GPU pipeline
parallelism discussed in [Single-Machine Model Parallel Best Practices](model_parallel_tutorial.html).
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 本教程使用Resnet50模型演示了如何使用[torch.distributed.rpc](https://pytorch.org/docs/master/rpc.html)
API实现分布式管道并行。这可以看作是[单机模型并行最佳实践](model_parallel_tutorial.html)中讨论的多GPU管道并行的分布式对应。
- en: Note
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: This tutorial requires PyTorch v1.6.0 or above.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 本教程要求使用PyTorch v1.6.0或更高版本。
- en: Note
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Full source code of this tutorial can be found at [pytorch/examples](https://github.com/pytorch/examples/tree/master/distributed/rpc/pipeline).
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: 本教程的完整源代码可以在[pytorch/examples](https://github.com/pytorch/examples/tree/master/distributed/rpc/pipeline)找到。
- en: Basics
id: totrans-15
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 基础知识
- en: The previous tutorial, [Getting Started with Distributed RPC Framework](rpc_tutorial.html)
shows how to use [torch.distributed.rpc](https://pytorch.org/docs/master/rpc.html)
to implement distributed model parallelism for an RNN model. That tutorial uses
......@@ -66,8 +99,10 @@
if a model lives on multiple GPUs, it would require some extra steps to increase
the amortized utilization of all GPUs. Pipeline parallelism is one type of paradigm
that can help in this case.
id: totrans-16
prefs: []
type: TYPE_NORMAL
zh: 之前的教程[开始使用分布式RPC框架](rpc_tutorial.html)展示了如何使用[torch.distributed.rpc](https://pytorch.org/docs/master/rpc.html)为RNN模型实现分布式模型并行。该教程使用一个GPU来托管`EmbeddingTable`,提供的代码可以正常工作。但是,如果一个模型存在于多个GPU上,就需要一些额外的步骤来增加所有GPU的摊销利用率。管道并行是一种可以在这种情况下有所帮助的范式之一。
- en: In this tutorial, we use `ResNet50` as an example model which is also used by
the [Single-Machine Model Parallel Best Practices](model_parallel_tutorial.html)
tutorial. Similarly, the `ResNet50` model is divided into two shards and the input
......@@ -76,21 +111,29 @@
using CUDA streams, this tutorial invokes asynchronous RPCs. So, the solution
presented in this tutorial also works across machine boundaries. The remainder
of this tutorial presents the implementation in four steps.
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们以`ResNet50`作为示例模型,该模型也被[单机模型并行最佳实践](model_parallel_tutorial.html)教程使用。类似地,`ResNet50`模型被分成两个分片,并且输入批次被分成多个部分并以流水线方式馈送到两个模型分片中。不同之处在于,本教程使用异步RPC调用来并行执行,而不是使用CUDA流来并行执行。因此,本教程中提出的解决方案也适用于跨机器边界。本教程的其余部分将以四个步骤呈现实现。
- en: 'Step 1: Partition ResNet50 Model'
id: totrans-18
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤1:对ResNet50模型进行分区
- en: This is the preparation step which implements `ResNet50` in two model shards.
The code below is borrowed from the [ResNet implementation in torchvision](https://github.com/pytorch/vision/blob/7c077f6a986f05383bcb86b535aedb5a63dd5c4b/torchvision/models/resnet.py#L124).
The `ResNetBase` module contains the common building blocks and attributes for
the two ResNet shards.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 这是准备步骤,实现了在两个模型分片中的`ResNet50`。下面的代码是从[torchvision中的ResNet实现](https://github.com/pytorch/vision/blob/7c077f6a986f05383bcb86b535aedb5a63dd5c4b/torchvision/models/resnet.py#L124)借用的。`ResNetBase`模块包含了两个ResNet分片的共同构建块和属性。
- en: '[PRE0]'
id: totrans-20
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Now, we are ready to define the two model shards. For the constructor, we simply
split all ResNet50 layers into two parts and move each part into the provided
device. The `forward` functions of both shards take an `RRef` of the input data,
......@@ -98,15 +141,22 @@
all layers to the input, it moves the output to CPU and returns. It is because
the RPC API requires tensors to reside on CPU to avoid invalid device errors when
the numbers of devices in the caller and the callee do not match.
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 现在,我们准备定义两个模型分片。对于构造函数,我们简单地将所有ResNet50层分成两部分,并将每部分移动到提供的设备上。这两个分片的`forward`函数接受输入数据的`RRef`,在本地获取数据,然后将其移动到预期的设备上。在将所有层应用于输入后,将输出移动到CPU并返回。这是因为RPC
API要求张量驻留在CPU上,以避免在调用方和被调用方的设备数量不匹配时出现无效设备错误。
- en: '[PRE1]'
id: totrans-22
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: 'Step 2: Stitch ResNet50 Model Shards Into One Module'
id: totrans-23
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤2:将ResNet50模型分片拼接成一个模块
- en: Then, we create a `DistResNet50` module to assemble the two shards and implement
the pipeline parallel logic. In the constructor, we use two `rpc.remote` calls
to put the two shards on two different RPC workers respectively and hold on to
......@@ -124,15 +174,21 @@
of all micro-batches into one single output tensor and returns. The `parameter_rrefs`
function is a helper to simplify distributed optimizer construction, which will
be used later.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 然后,我们创建一个`DistResNet50`模块来组装两个分片并实现管道并行逻辑。在构造函数中,我们使用两个`rpc.remote`调用分别将两个分片放在两个不同的RPC工作进程上,并保留两个模型部分的`RRef`,以便它们可以在前向传递中引用。`forward`函数将输入批次分成多个微批次,并以管道方式将这些微批次馈送到两个模型部分。它首先使用`rpc.remote`调用将第一个分片应用于微批次,然后将返回的中间输出`RRef`转发到第二个模型分片。之后,它收集所有微输出的`Future`,并在循环后等待所有微输出。请注意,`remote()`和`rpc_async()`都会立即返回并异步运行。因此,整个循环是非阻塞的,并且将同时启动多个RPC。通过中间输出`y_rref`保留了两个模型部分上一个微批次的执行顺序。跨微批次的执行顺序并不重要。最后,forward函数将所有微批次的输出连接成一个单一的输出张量并返回。`parameter_rrefs`函数是一个辅助函数,用于简化分布式优化器的构建,稍后将使用它。
- en: '[PRE2]'
id: totrans-25
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: 'Step 3: Define The Training Loop'
id: totrans-26
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤3:定义训练循环
- en: After defining the model, let us implement the training loop. We use a dedicated
“master” worker to prepare random inputs and labels, and control the distributed
backward pass and distributed optimizer step. It first creates an instance of
......@@ -143,21 +199,31 @@
training loop is very similar to regular local training, except that it uses `dist_autograd`
to launch backward and provides the `context_id` for both backward and optimizer
`step()`.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 在定义模型之后,让我们实现训练循环。我们使用一个专用的“主”工作进程来准备随机输入和标签,并控制分布式反向传递和分布式优化器步骤。首先创建一个`DistResNet50`模块的实例。它指定每个批次的微批次数量,并提供两个RPC工作进程的名称(即“worker1”和“worker2”)。然后定义损失函数,并使用`parameter_rrefs()`助手创建一个`DistributedOptimizer`来获取参数`RRefs`的列表。然后,主要训练循环与常规本地训练非常相似,只是它使用`dist_autograd`来启动反向传递,并为反向传递和优化器`step()`提供`context_id`。
- en: '[PRE3]'
id: totrans-28
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: 'Step 4: Launch RPC Processes'
id: totrans-29
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 步骤4:启动RPC进程
- en: Finally, the code below shows the target function for all processes. The main
logic is defined in `run_master`. The workers passively waiting for commands from
the master, and hence simply runs `init_rpc` and `shutdown`, where the `shutdown`
by default will block until all RPC participants finish.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 最后,下面的代码展示了所有进程的目标函数。主要逻辑在`run_master`中定义。工作进程 passively 等待来自主进程的命令,因此只需运行`init_rpc`和`shutdown`,其中`shutdown`默认情况下将阻塞,直到所有RPC参与者完成。
- en: '[PRE4]'
id: totrans-31
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
此差异已折叠。
- en: Combining Distributed DataParallel with Distributed RPC Framework
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 将分布式DataParallel与分布式RPC框架结合起来
- en: 原文:[https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html](https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html](https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html)
- en: '**Authors**: [Pritam Damania](https://github.com/pritamdamania87) and [Yi Wang](https://github.com/wayi1)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Pritam Damania](https://github.com/pritamdamania87) [Yi Wang](https://github.com/wayi1)'
- en: Note
id: totrans-3
prefs: []
type: TYPE_NORMAL
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png)
View and edit this tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/rpc_ddp_tutorial.rst).'
zh: 注意
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/advanced_source/rpc_ddp_tutorial.rst).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) 在[github](https://github.com/pytorch/tutorials/blob/main/advanced_source/rpc_ddp_tutorial.rst)中查看和编辑本教程。'
- en: This tutorial uses a simple example to demonstrate how you can combine [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)
(DDP) with the [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html)
to combine distributed data parallelism with distributed model parallelism to
train a simple model. Source code of the example can be found [here](https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc).
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程使用一个简单的示例来演示如何将[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)(DDP)与[Distributed
RPC framework](https://pytorch.org/docs/master/rpc.html)结合起来,以将分布式数据并行与分布式模型并行结合起来训练一个简单的模型。示例的源代码可以在[这里](https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc)找到。
- en: 'Previous tutorials, [Getting Started With Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
and [Getting Started with Distributed RPC Framework](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html),
described how to perform distributed data parallel and distributed model parallel
training respectively. Although, there are several training paradigms where you
might want to combine these two techniques. For example:'
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 之前的教程,[Getting Started With Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)和[Getting
Started with Distributed RPC Framework](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html),分别描述了如何执行分布式数据并行和分布式模型并行训练。尽管如此,有几种训练范式可能需要结合这两种技术。例如:
- en: If we have a model with a sparse part (large embedding table) and a dense part
(FC layers), we might want to put the embedding table on a parameter server and
replicate the FC layer across multiple trainers using [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
The [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html) can
be used to perform embedding lookups on the parameter server.
id: totrans-7
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 如果我们的模型有一个稀疏部分(大型嵌入表)和一个稠密部分(FC层),我们可能希望将嵌入表放在参数服务器上,并使用[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)将FC层复制到多个训练器上。[Distributed
RPC framework](https://pytorch.org/docs/master/rpc.html)可用于在参数服务器上执行嵌入查找。
- en: Enable hybrid parallelism as described in the [PipeDream](https://arxiv.org/abs/1806.03377)
paper. We can use the [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html)
to pipeline stages of the model across multiple workers and replicate each stage
(if needed) using [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
id: totrans-8
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 启用混合并行,如[PipeDream](https://arxiv.org/abs/1806.03377)论文中所述。我们可以使用[Distributed
RPC framework](https://pytorch.org/docs/master/rpc.html)将模型的阶段在多个工作节点上进行流水线处理,并使用[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)复制每个阶段(如果需要)。
- en: 'In this tutorial we will cover case 1 mentioned above. We have a total of 4
workers in our setup as follows:'
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们将涵盖上述第1种情况。在我们的设置中,总共有4个工作节点:
- en: 1 Master, which is responsible for creating an embedding table (nn.EmbeddingBag)
on the parameter server. The master also drives the training loop on the two trainers.
id: totrans-10
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 1个主节点,负责在参数服务器上创建一个嵌入表(nn.EmbeddingBag)。主节点还驱动两个训练器的训练循环。
- en: 1 Parameter Server, which basically holds the embedding table in memory and
responds to RPCs from the Master and Trainers.
id: totrans-11
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 1个参数服务器,基本上在内存中保存嵌入表,并响应来自主节点和训练器的RPC。
- en: 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst
themselves using [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
The trainers are also responsible for executing the forward pass, backward pass
and optimizer step.
id: totrans-12
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 2个训练器,它们存储一个在它们之间复制的FC层(nn.Linear),使用[DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)。这些训练器还负责执行前向传播、反向传播和优化器步骤。
- en: 'The entire training process is executed as follows:'
id: totrans-13
prefs: []
type: TYPE_NORMAL
zh: 整个训练过程如下执行:
- en: The master creates a [RemoteModule](https://pytorch.org/docs/master/rpc.html#remotemodule)
that holds an embedding table on the Parameter Server.
id: totrans-14
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 主节点创建一个[RemoteModule](https://pytorch.org/docs/master/rpc.html#remotemodule),在参数服务器上保存一个嵌入表。
- en: The master, then kicks off the training loop on the trainers and passes the
remote module to the trainers.
id: totrans-15
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 然后主节点启动训练循环,并将远程模块传递给训练器。
- en: The trainers create a `HybridModel` which first performs an embedding lookup
using the remote module provided by the master and then executes the FC layer
which is wrapped inside DDP.
id: totrans-16
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 训练器创建一个`HybridModel`,首先使用主节点提供的远程模块进行嵌入查找,然后执行包含在DDP中的FC层。
- en: The trainer executes the forward pass of the model and uses the loss to execute
the backward pass using [Distributed Autograd](https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework).
id: totrans-17
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 训练器执行模型的前向传播,并使用损失执行反向传播,使用[Distributed Autograd](https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework)。
- en: As part of the backward pass, the gradients for the FC layer are computed first
and synced to all trainers via allreduce in DDP.
id: totrans-18
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 在反向传播的过程中,首先计算FC层的梯度,然后通过DDP中的allreduce同步到所有训练器。
- en: Next, Distributed Autograd propagates the gradients to the parameter server,
where the gradients for the embedding table are updated.
id: totrans-19
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 接下来,Distributed Autograd将梯度传播到参数服务器,更新嵌入表的梯度。
- en: Finally, the [Distributed Optimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim)
is used to update all the parameters.
id: totrans-20
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 最后,使用[Distributed Optimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim)来更新所有参数。
- en: Attention
id: totrans-21
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: You should always use [Distributed Autograd](https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework)
for the backward pass if you’re combining DDP and RPC.
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: 如果结合DDP和RPC,应始终使用[Distributed Autograd](https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework)进行反向传播。
- en: Now, let’s go through each part in detail. Firstly, we need to setup all of
our workers before we can perform any training. We create 4 processes such that
ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the parameter
server.
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 现在,让我们逐个详细介绍每个部分。首先,我们需要在进行任何训练之前设置所有的worker。我们创建4个进程,其中rank 0和1是我们的Trainer,rank
2是主节点,rank 3是参数服务器。
- en: We initialize the RPC framework on all 4 workers using the TCP init_method.
Once RPC initialization is done, the master creates a remote module that holds
an [EmbeddingBag](https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html)
......@@ -124,8 +177,10 @@
The master then loops through each trainer and kicks off the training loop by
calling `_run_trainer` on each trainer using [rpc_async](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async).
Finally, the master waits for all training to finish before exiting.
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: 我们使用TCP init_method在所有4个worker上初始化RPC框架。一旦RPC初始化完成,主节点会创建一个远程模块,该模块在参数服务器上保存了一个[EmbeddingBag](https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html)层,使用[RemoteModule](https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule)。然后主节点循环遍历每个Trainer,并通过调用[rpc_async](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async)在每个Trainer上调用`_run_trainer`来启动训练循环。最后,主节点在退出之前等待所有训练完成。
- en: The trainers first initialize a `ProcessGroup` for DDP with world_size=2 (for
two trainers) using [init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
Next, they initialize the RPC framework using the TCP init_method. Note that the
......@@ -133,36 +188,52 @@
is to avoid port conflicts between initialization of both frameworks. Once the
initialization is done, the trainers just wait for the `_run_trainer` RPC from
the master.
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: Trainer首先为DDP初始化一个world_size=2(两个Trainer)的`ProcessGroup`,使用[init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)。接下来,他们使用TCP
init_method初始化RPC框架。请注意,RPC初始化和ProcessGroup初始化中的端口是不同的。这是为了避免两个框架初始化之间的端口冲突。初始化完成后,Trainer只需等待来自主节点的`_run_trainer`
RPC。
- en: The parameter server just initializes the RPC framework and waits for RPCs from
the trainers and master.
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 参数服务器只是初始化RPC框架并等待来自Trainer和主节点的RPC。
- en: '[PRE0]'
id: totrans-27
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Before we discuss details of the Trainer, let’s introduce the `HybridModel`
that the trainer uses. As described below, the `HybridModel` is initialized using
a remote module that holds an embedding table (`remote_emb_module`) on the parameter
server and the `device` to use for DDP. The initialization of the model wraps
an [nn.Linear](https://pytorch.org/docs/master/generated/torch.nn.Linear.html)
layer inside DDP to replicate and synchronize this layer across all trainers.
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 在讨论Trainer的细节之前,让我们先介绍一下Trainer使用的`HybridModel`。如下所述,`HybridModel`是使用一个远程模块进行初始化的,该远程模块在参数服务器上保存了一个嵌入表(`remote_emb_module`)和用于DDP的`device`。模型的初始化将一个[nn.Linear](https://pytorch.org/docs/master/generated/torch.nn.Linear.html)层包装在DDP中,以便在所有Trainer之间复制和同步这个层。
- en: The forward method of the model is pretty straightforward. It performs an embedding
lookup on the parameter server using RemoteModule’s `forward` and passes its output
onto the FC layer.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 模型的前向方法非常简单。它使用RemoteModule的`forward`在参数服务器上进行嵌入查找,并将其输出传递给FC层。
- en: '[PRE1]'
id: totrans-30
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: Next, let’s look at the setup on the Trainer. The trainer first creates the
`HybridModel` described above using a remote module that holds the embedding table
on the parameter server and its own rank.
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 接下来,让我们看一下Trainer的设置。Trainer首先使用一个远程模块创建上述`HybridModel`,该远程模块在参数服务器上保存了嵌入表和自己的rank。
- en: Now, we need to retrieve a list of RRefs to all the parameters that we would
like to optimize with [DistributedOptimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim).
To retrieve the parameters for the embedding table from the parameter server,
......@@ -176,44 +247,66 @@
it to the list returned from `remote_parameters()`. Note that we cannnot use `model.parameters()`,
because it will recursively call `model.remote_emb_module.parameters()`, which
is not supported by `RemoteModule`.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 现在,我们需要获取一个RRefs列表,其中包含我们想要使用[DistributedOptimizer](https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim)进行优化的所有参数。为了从参数服务器检索嵌入表的参数,我们可以调用RemoteModule的[remote_parameters](https://pytorch.org/docs/master/rpc.html#torch.distributed.nn.api.remote_module.RemoteModule.remote_parameters),这个方法基本上遍历了嵌入表的所有参数,并返回一个RRefs列表。Trainer通过RPC在参数服务器上调用这个方法,以接收到所需参数的RRefs列表。由于DistributedOptimizer始终需要一个要优化的参数的RRefs列表,我们需要为FC层的本地参数创建RRefs。这是通过遍历`model.fc.parameters()`,为每个参数创建一个RRef,并将其附加到从`remote_parameters()`返回的列表中完成的。请注意,我们不能使用`model.parameters()`,因为它会递归调用`model.remote_emb_module.parameters()`,这是`RemoteModule`不支持的。
- en: Finally, we create our DistributedOptimizer using all the RRefs and define a
CrossEntropyLoss function.
id: totrans-33
prefs: []
type: TYPE_NORMAL
zh: 最后,我们使用所有的RRefs创建我们的DistributedOptimizer,并定义一个CrossEntropyLoss函数。
- en: '[PRE2]'
id: totrans-34
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: 'Now we’re ready to introduce the main training loop that is run on each trainer.
`get_next_batch` is just a helper function to generate random inputs and targets
for training. We run the training loop for multiple epochs and for each batch:'
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 现在我们准备介绍在每个训练器上运行的主要训练循环。`get_next_batch`只是一个辅助函数,用于生成训练的随机输入和目标。我们对多个epochs和每个batch运行训练循环:
- en: Setup a [Distributed Autograd Context](https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context)
for Distributed Autograd.
id: totrans-36
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 为分布式自动求导设置[Distributed Autograd Context](https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context)。
- en: Run the forward pass of the model and retrieve its output.
id: totrans-37
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 运行模型的前向传播并检索其输出。
- en: Compute the loss based on our outputs and targets using the loss function.
id: totrans-38
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 使用损失函数基于我们的输出和目标计算损失。
- en: Use Distributed Autograd to execute a distributed backward pass using the loss.
id: totrans-39
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 使用分布式自动求导来执行使用损失函数的分布式反向传播。
- en: Finally, run a Distributed Optimizer step to optimize all the parameters.
id: totrans-40
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 最后,运行一个分布式优化器步骤来优化所有参数。
- en: '[PRE3]'
id: totrans-41
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Source code for the entire example can be found [here](https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc).
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 整个示例的源代码可以在[这里](https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc)找到。
- en: Training Transformer models using Pipeline Parallelism
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用管道并行性训练Transformer模型
- en: 原文:[https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html](https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html](https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-intermediate-pipeline-tutorial-py) to download
the full example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-intermediate-pipeline-tutorial-py)下载完整示例代码
- en: '**Author**: [Pritam Damania](https://github.com/pritamdamania87)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Pritam Damania](https://github.com/pritamdamania87)'
- en: This tutorial demonstrates how to train a large Transformer model across multiple
GPUs using pipeline parallelism. This tutorial is an extension of the [Sequence-to-Sequence
Modeling with nn.Transformer and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
tutorial and scales up the same model to demonstrate how pipeline parallelism
can be used to train Transformer models.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程演示了如何使用管道并行性在多个GPU上训练大型Transformer模型。本教程是[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)教程的延伸,并扩展了相同的模型,以演示如何使用管道并行性来训练Transformer模型。
- en: 'Prerequisites:'
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 先决条件:
- en: '[Pipeline Parallelism](https://pytorch.org/docs/stable/pipeline.html)'
id: totrans-7
prefs:
- PREF_BQ
- PREF_UL
type: TYPE_NORMAL
zh: '[管道并行性](https://pytorch.org/docs/stable/pipeline.html)'
- en: ''
id: totrans-8
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: ''
id: totrans-9
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: '[Sequence-to-Sequence Modeling with nn.Transformer and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)'
id: totrans-10
prefs:
- PREF_BQ
- PREF_UL
type: TYPE_NORMAL
zh: '[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)'
- en: Define the model
id: totrans-11
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 定义模型
- en: In this tutorial, we will split a Transformer model across two GPUs and use
pipeline parallelism to train the model. The model is exactly the same model used
in the [Sequence-to-Sequence Modeling with nn.Transformer and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
......@@ -62,112 +84,166 @@
are on another. To do this, we pull out the `Encoder` and `Decoder` sections into
separate modules and then build an `nn.Sequential` representing the original Transformer
module.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们将把一个Transformer模型分成两个GPU,并使用管道并行性来训练模型。该模型与[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)教程中使用的模型完全相同,但被分成两个阶段。最大数量的参数属于[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)层。[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)本身由`nlayers`个[nn.TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)组成。因此,我们的重点是`nn.TransformerEncoder`,我们将模型分成一半的`nn.TransformerEncoderLayer`在一个GPU上,另一半在另一个GPU上。为此,我们将`Encoder`和`Decoder`部分提取到单独的模块中,然后构建一个代表原始Transformer模块的`nn.Sequential`。
- en: '[PRE0]'
id: totrans-13
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: '`PositionalEncoding` module injects some information about the relative or
absolute position of the tokens in the sequence. The positional encodings have
the same dimension as the embeddings so that the two can be summed. Here, we use
`sine` and `cosine` functions of different frequencies.'
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: '`PositionalEncoding`模块注入了关于序列中标记的相对或绝对位置的一些信息。位置编码与嵌入具有相同的维度,因此可以将两者相加。在这里,我们使用不同频率的`sine`和`cosine`函数。'
- en: '[PRE1]'
id: totrans-15
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: Load and batch data
id: totrans-16
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 加载和批处理数据
- en: The training process uses Wikitext-2 dataset from `torchtext`. To access torchtext
datasets, please install torchdata following instructions at [https://github.com/pytorch/data](https://github.com/pytorch/data).
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 训练过程使用了来自`torchtext`的Wikitext-2数据集。要访问torchtext数据集,请按照[https://github.com/pytorch/data](https://github.com/pytorch/data)上的说明安装torchdata。
- en: 'The vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the `batchify()` function
arranges the dataset into columns, trimming off any tokens remaining after the
data has been divided into batches of size `batch_size`. For instance, with the
alphabet as the sequence (total length of 26) and a batch size of 4, we would
divide the alphabet into 4 sequences of length 6:'
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: vocab对象是基于训练数据集构建的,并用于将标记数值化为张量。从顺序数据开始,`batchify()`函数将数据集排列成列,将数据分成大小为`batch_size`的批次后,修剪掉任何剩余的标记。例如,以字母表作为序列(总长度为26)和批次大小为4,我们将字母表分成长度为6的4个序列:
- en: \[\begin{bmatrix} \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y}
& \text{Z} \end{bmatrix} \Rightarrow \begin{bmatrix} \begin{bmatrix}\text{A} \\
\text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} & \begin{bmatrix}\text{G}
\\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} & \begin{bmatrix}\text{M}
\\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} & \begin{bmatrix}\text{S}
\\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix} \end{bmatrix}\]
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: \[\begin{bmatrix} \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y}
& \text{Z} \end{bmatrix} \Rightarrow \begin{bmatrix} \begin{bmatrix}\text{A} \\
\text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} & \begin{bmatrix}\text{G}
\\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} & \begin{bmatrix}\text{M}
\\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} & \begin{bmatrix}\text{S}
\\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix} \end{bmatrix}\]
- en: These columns are treated as independent by the model, which means that the
dependence of `G` and `F` can not be learned, but allows more efficient batch
processing.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 模型将这些列视为独立的,这意味着无法学习`G`和`F`之间的依赖关系,但可以实现更高效的批处理。
- en: '[PRE2]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: Functions to generate input and target sequence
id: totrans-22
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 生成输入和目标序列的函数
- en: '`get_batch()` function generates the input and target sequence for the transformer
model. It subdivides the source data into chunks of length `bptt`. For the language
modeling task, the model needs the following words as `Target`. For example, with
a `bptt` value of 2, we’d get the following two Variables for `i` = 0:'
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: '`get_batch()`函数为transformer模型生成输入和目标序列。它将源数据细分为长度为`bptt`的块。对于语言建模任务,模型需要以下单词作为`Target`。例如,对于`bptt`值为2,我们会得到`i`
= 0时的以下两个变量:'
- en: '![../_images/transformer_input_target.png](../Images/20ef8681366b44461cf49d1ab98ab8f2.png)'
id: totrans-24
prefs: []
type: TYPE_IMG
zh: '![../_images/transformer_input_target.png](../Images/20ef8681366b44461cf49d1ab98ab8f2.png)'
- en: It should be noted that the chunks are along dimension 0, consistent with the
`S` dimension in the Transformer model. The batch dimension `N` is along dimension
1.
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: 应该注意到,块沿着维度0,与Transformer模型中的`S`维度一致。批量维度`N`沿着维度1。
- en: '[PRE3]'
id: totrans-26
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Model scale and Pipe initialization
id: totrans-27
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 模型规模和Pipe初始化
- en: To demonstrate training large Transformer models using pipeline parallelism,
we scale up the Transformer layers appropriately. We use an embedding dimension
of 4096, hidden size of 4096, 16 attention heads and 12 total transformer layers
(`nn.TransformerEncoderLayer`). This creates a model with **~1.4 billion** parameters.
id: totrans-28
prefs: []
type: TYPE_NORMAL
zh: 为了展示使用管道并行性训练大型Transformer模型,我们适当地扩展了Transformer层。我们使用了4096的嵌入维度,4096的隐藏大小,16个注意力头和12个总的Transformer层(`nn.TransformerEncoderLayer`)。这创建了一个拥有**~14亿**参数的模型。
- en: We need to initialize the [RPC Framework](https://pytorch.org/docs/stable/rpc.html)
since Pipe depends on the RPC framework via [RRef](https://pytorch.org/docs/stable/rpc.html#rref)
which allows for future expansion to cross host pipelining. We need to initialize
the RPC framework with only a single worker since we’re using a single process
to drive multiple GPUs.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 我们需要初始化[RPC框架](https://pytorch.org/docs/stable/rpc.html),因为Pipe依赖于RPC框架通过[RRef](https://pytorch.org/docs/stable/rpc.html#rref)进行跨主机流水线扩展。我们需要仅使用单个worker初始化RPC框架,因为我们使用单个进程来驱动多个GPU。
- en: The pipeline is then initialized with 8 transformer layers on one GPU and 8
transformer layers on the other GPU.
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 然后,在一个GPU上初始化8个transformer层,并在另一个GPU上初始化8个transformer层。
- en: Note
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: For efficiency purposes we ensure that the `nn.Sequential` passed to `Pipe`
only consists of two elements (corresponding to two GPUs), this allows the Pipe
to work with only two partitions and avoid any cross-partition overheads.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 为了提高效率,我们确保传递给`Pipe`的`nn.Sequential`只包含两个元素(对应两个GPU),这允许Pipe仅使用两个分区并避免任何跨分区的开销。
- en: '[PRE4]'
id: totrans-33
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '[PRE5]'
id: totrans-34
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Run the model
id: totrans-35
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 运行模型
- en: '[CrossEntropyLoss](https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)
is applied to track the loss and [SGD](https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD)
implements stochastic gradient descent method as the optimizer. The initial learning
......@@ -175,43 +251,69 @@
is applied to adjust the learn rate through epochs. During the training, we use
[nn.utils.clip_grad_norm_](https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_)
function to scale all the gradient together to prevent exploding.'
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: '[CrossEntropyLoss](https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)用于跟踪损失,[SGD](https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD)实现随机梯度下降方法作为优化器。初始学习率设置为5.0。[StepLR](https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR)用于通过epoch调整学习率。在训练期间,我们使用[nn.utils.clip_grad_norm_](https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_)函数将所有梯度一起缩放,以防止梯度爆炸。'
- en: '[PRE6]'
id: totrans-37
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: Loop over epochs. Save the model if the validation loss is the best we’ve seen
so far. Adjust the learning rate after each epoch.
id: totrans-38
prefs: []
type: TYPE_NORMAL
zh: 循环迭代。如果验证损失是迄今为止最好的,则保存模型。每个epoch后调整学习率。
- en: '[PRE7]'
id: totrans-39
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: '[PRE8]'
id: totrans-40
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: Evaluate the model with the test dataset
id: totrans-41
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 用测试数据集评估模型
- en: Apply the best model to check the result with the test dataset.
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 应用最佳模型来检查与测试数据集的结果。
- en: '[PRE9]'
id: totrans-43
prefs: []
type: TYPE_PRE
zh: '[PRE9]'
- en: '[PRE10]'
id: totrans-44
prefs: []
type: TYPE_PRE
zh: '[PRE10]'
- en: '**Total running time of the script:** ( 8 minutes 5.064 seconds)'
id: totrans-45
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(8分钟5.064秒)'
- en: '[`Download Python source code: pipeline_tutorial.py`](../_downloads/b4afbcfb1c1ac5f5cd7da108c2236f09/pipeline_tutorial.py)'
id: totrans-46
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码:pipeline_tutorial.py`](../_downloads/b4afbcfb1c1ac5f5cd7da108c2236f09/pipeline_tutorial.py)'
- en: '[`Download Jupyter notebook: pipeline_tutorial.ipynb`](../_downloads/4cefa4723023eb5d85ed047dadc7f491/pipeline_tutorial.ipynb)'
id: totrans-47
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本:pipeline_tutorial.ipynb`](../_downloads/4cefa4723023eb5d85ed047dadc7f491/pipeline_tutorial.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-48
prefs: []
type: TYPE_NORMAL
zh: '[由Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
- en: Training Transformer models using Distributed Data Parallel and Pipeline Parallelism
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 使用Distributed Data Parallel和Pipeline Parallelism训练Transformer模型
- en: 原文:[https://pytorch.org/tutorials/advanced/ddp_pipeline.html](https://pytorch.org/tutorials/advanced/ddp_pipeline.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/advanced/ddp_pipeline.html](https://pytorch.org/tutorials/advanced/ddp_pipeline.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-advanced-ddp-pipeline-py) to download the full
example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-advanced-ddp-pipeline-py)下载完整示例代码
- en: '**Author**: [Pritam Damania](https://github.com/pritamdamania87)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Pritam Damania](https://github.com/pritamdamania87)'
- en: This tutorial demonstrates how to train a large Transformer model across multiple
GPUs using [Distributed Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
and [Pipeline Parallelism](https://pytorch.org/docs/stable/pipeline.html). This
......@@ -23,59 +33,82 @@
and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
tutorial and scales up the same model to demonstrate how Distributed Data Parallel
and Pipeline Parallelism can be used to train Transformer models.
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 本教程演示了如何使用[Distributed Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)和[Pipeline
Parallelism](https://pytorch.org/docs/stable/pipeline.html)在多个GPU上训练大型Transformer模型。本教程是[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)教程的延伸,扩展了相同的模型以演示如何使用Distributed
Data Parallel和Pipeline Parallelism来训练Transformer模型。
- en: 'Prerequisites:'
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 先决条件:
- en: '[Pipeline Parallelism](https://pytorch.org/docs/stable/pipeline.html)'
id: totrans-7
prefs:
- PREF_BQ
- PREF_UL
type: TYPE_NORMAL
zh: '[管道并行](https://pytorch.org/docs/stable/pipeline.html)'
- en: ''
id: totrans-8
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: ''
id: totrans-9
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: '[Sequence-to-Sequence Modeling with nn.Transformer and TorchText](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)'
id: totrans-10
prefs:
- PREF_BQ
- PREF_UL
type: TYPE_NORMAL
zh: '[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)'
- en: ''
id: totrans-11
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: ''
id: totrans-12
prefs:
- PREF_BQ
- PREF_IND
type: TYPE_NORMAL
- en: '[Getting Started with Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)'
id: totrans-13
prefs:
- PREF_BQ
- PREF_UL
type: TYPE_NORMAL
zh: '[使用分布式数据并行开始](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)'
- en: Define the model
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 定义模型
- en: '`PositionalEncoding` module injects some information about the relative or
absolute position of the tokens in the sequence. The positional encodings have
the same dimension as the embeddings so that the two can be summed. Here, we use
`sine` and `cosine` functions of different frequencies.'
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: '`PositionalEncoding` 模块向序列中的令牌注入了一些关于相对或绝对位置的信息。位置编码与嵌入的维度相同,因此可以将两者相加。在这里,我们使用不同频率的
`sine` `cosine` 函数。'
- en: '[PRE0]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: In this tutorial, we will split a Transformer model across two GPUs and use
pipeline parallelism to train the model. In addition to this, we use [Distributed
Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
......@@ -93,104 +126,156 @@
are on another. To do this, we pull out the `Encoder` and `Decoder` sections into
separate modules and then build an `nn.Sequential` representing the original Transformer
module.
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 在本教程中,我们将一个Transformer模型分割到两个GPU上,并使用管道并行来训练模型。除此之外,我们使用[Distributed Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)来训练这个管道的两个副本。我们有一个进程在GPU
0和1之间驱动一个管道,另一个进程在GPU 2和3之间驱动一个管道。然后,这两个进程使用Distributed Data Parallel来训练这两个副本。模型与[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)教程中使用的模型完全相同,但被分成了两个阶段。最多的参数属于[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)层。[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)本身由`nlayers`个[nn.TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)组成。因此,我们的重点是`nn.TransformerEncoder`,我们将模型分割成一半的`nn.TransformerEncoderLayer`在一个GPU上,另一半在另一个GPU上。为此,我们将`Encoder`和`Decoder`部分提取到单独的模块中,然后构建一个代表原始Transformer模块的`nn.Sequential`。
- en: '[PRE1]'
id: totrans-18
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: Start multiple processes for training
id: totrans-19
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 启动多个进程进行训练
- en: We start two processes where each process drives its own pipeline across two
GPUs. `run_worker` is executed for each process.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 我们启动两个进程,每个进程在两个GPU上驱动自己的管道。对于每个进程,都会执行`run_worker`。
- en: '[PRE2]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: Load and batch data
id: totrans-22
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 加载和批处理数据
- en: The training process uses Wikitext-2 dataset from `torchtext`. To access torchtext
datasets, please install torchdata following instructions at [https://github.com/pytorch/data](https://github.com/pytorch/data).
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 训练过程使用了来自`torchtext`的Wikitext-2数据集。要访问torchtext数据集,请按照[https://github.com/pytorch/data](https://github.com/pytorch/data)上的说明安装torchdata。
- en: 'The vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the `batchify()` function
arranges the dataset into columns, trimming off any tokens remaining after the
data has been divided into batches of size `batch_size`. For instance, with the
alphabet as the sequence (total length of 26) and a batch size of 4, we would
divide the alphabet into 4 sequences of length 6:'
id: totrans-24
prefs: []
type: TYPE_NORMAL
zh: vocab 对象是基于训练数据集构建的,并用于将令牌数值化为张量。从顺序数据开始,`batchify()` 函数将数据集排列成列,将数据分成大小为 `batch_size`
的批次后,修剪掉任何剩余的令牌。例如,对于字母表作为序列(总长度为26)和批次大小为4,我们将字母表分成长度为6的4个序列:
- en: \[ \begin{bmatrix} \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y}
& \text{Z} \end{bmatrix} \Rightarrow \begin{bmatrix} \begin{bmatrix}\text{A} \\
\text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} & \begin{bmatrix}\text{G}
\\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} & \begin{bmatrix}\text{M}
\\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} & \begin{bmatrix}\text{S}
\\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix} \end{bmatrix}\]
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: \[ \begin{bmatrix} \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y}
& \text{Z} \end{bmatrix} \Rightarrow \begin{bmatrix} \begin{bmatrix}\text{A} \\
\text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} & \begin{bmatrix}\text{G}
\\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} & \begin{bmatrix}\text{M}
\\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} & \begin{bmatrix}\text{S}
\\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix} \end{bmatrix}\]
- en: These columns are treated as independent by the model, which means that the
dependence of `G` and `F` can not be learned, but allows more efficient batch
processing.
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 这些列被模型视为独立的,这意味着`G`和`F`之间的依赖关系无法被学习,但可以实现更高效的批处理。
- en: '[PRE3]'
id: totrans-27
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Functions to generate input and target sequence
id: totrans-28
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 生成输入和目标序列的函数
- en: '`get_batch()` function generates the input and target sequence for the transformer
model. It subdivides the source data into chunks of length `bptt`. For the language
modeling task, the model needs the following words as `Target`. For example, with
a `bptt` value of 2, we’d get the following two Variables for `i` = 0:'
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: '`get_batch()`函数为变压器模型生成输入和目标序列。它将源数据细分为长度为`bptt`的块。对于语言建模任务,模型需要以下单词作为`目标`。例如,对于`bptt`值为2,我们会得到以下两个变量,对于`i`
= 0:'
- en: '![../_images/transformer_input_target.png](../Images/20ef8681366b44461cf49d1ab98ab8f2.png)'
id: totrans-30
prefs: []
type: TYPE_IMG
zh: '![../_images/transformer_input_target.png](../Images/20ef8681366b44461cf49d1ab98ab8f2.png)'
- en: It should be noted that the chunks are along dimension 0, consistent with the
`S` dimension in the Transformer model. The batch dimension `N` is along dimension
1.
id: totrans-31
prefs: []
type: TYPE_NORMAL
zh: 值得注意的是,块沿着维度0,与变压器模型中的`S`维度一致。批处理维度`N`沿着维度1。
- en: '[PRE4]'
id: totrans-32
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: Model scale and Pipe initialization
id: totrans-33
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 模型规模和Pipe初始化
- en: To demonstrate training large Transformer models using pipeline parallelism,
we scale up the Transformer layers appropriately. We use an embedding dimension
of 4096, hidden size of 4096, 16 attention heads and 8 total transformer layers
(`nn.TransformerEncoderLayer`). This creates a model with **~1 billion** parameters.
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 为了演示使用管道并行性训练大型Transformer模型,我们适当扩展Transformer层。我们使用4096的嵌入维度,4096的隐藏大小,16个注意力头和8个总变压器层(`nn.TransformerEncoderLayer`)。这创建了一个具有**~10亿**参数的模型。
- en: We need to initialize the [RPC Framework](https://pytorch.org/docs/stable/rpc.html)
since Pipe depends on the RPC framework via [RRef](https://pytorch.org/docs/stable/rpc.html#rref)
which allows for future expansion to cross host pipelining. We need to initialize
the RPC framework with only a single worker since we’re using a single process
to drive multiple GPUs.
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 我们需要初始化[RPC框架](https://pytorch.org/docs/stable/rpc.html),因为Pipe依赖于RPC框架通过[RRef](https://pytorch.org/docs/stable/rpc.html#rref)允许未来扩展到跨主机流水线。我们需要使用单个worker初始化RPC框架,因为我们使用单个进程驱动多个GPU。
- en: The pipeline is then initialized with 8 transformer layers on one GPU and 8
transformer layers on the other GPU. One pipe is setup across GPUs 0 and 1 and
another across GPUs 2 and 3\. Both pipes are then replicated using `DistributedDataParallel`.
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 然后,在一个GPU上初始化8个变压器层,并在另一个GPU上初始化8个变压器层。一个管道设置在GPU 0和1之间,另一个设置在GPU 2和3之间。然后使用`DistributedDataParallel`复制这两个管道。
- en: '[PRE5]'
id: totrans-37
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Run the model
id: totrans-38
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 运行模型
- en: '[CrossEntropyLoss](https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)
is applied to track the loss and [SGD](https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD)
implements stochastic gradient descent method as the optimizer. The initial learning
......@@ -198,44 +283,70 @@
is applied to adjust the learn rate through epochs. During the training, we use
[nn.utils.clip_grad_norm_](https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_)
function to scale all the gradient together to prevent exploding.'
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: '[交叉熵损失](https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)用于跟踪损失,[SGD](https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD)实现随机梯度下降方法作为优化器。初始学习率设置为5.0。[StepLR](https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR)用于通过epochs调整学习率。在训练过程中,我们使用[nn.utils.clip_grad_norm_](https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_)函数将所有梯度一起缩放,以防止梯度爆炸。'
- en: '[PRE6]'
id: totrans-40
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: Loop over epochs. Save the model if the validation loss is the best we’ve seen
so far. Adjust the learning rate after each epoch.
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 循环遍历epochs。如果验证损失是迄今为止看到的最佳损失,则保存模型。每个epoch后调整学习率。
- en: '[PRE7]'
id: totrans-42
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: Evaluate the model with the test dataset
id: totrans-43
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 用测试数据集评估模型
- en: Apply the best model to check the result with the test dataset.
id: totrans-44
prefs: []
type: TYPE_NORMAL
zh: 将最佳模型应用于测试数据集以检查结果。
- en: '[PRE8]'
id: totrans-45
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: Output
id: totrans-46
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 输出
- en: '[PRE9]'
id: totrans-47
prefs: []
type: TYPE_PRE
zh: '[PRE9]'
- en: '**Total running time of the script:** ( 0 minutes 0.000 seconds)'
id: totrans-48
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(0分钟0.000秒)'
- en: '[`Download Python source code: ddp_pipeline.py`](../_downloads/a4d9c51b5b801ca67ec48cde53047460/ddp_pipeline.py)'
id: totrans-49
prefs: []
type: TYPE_NORMAL
zh: 下载Python源代码:ddp_pipeline.py
- en: '[`Download Jupyter notebook: ddp_pipeline.ipynb`](../_downloads/9c42ef95b5e306580f45ed7f652191bf/ddp_pipeline.ipynb)'
id: totrans-50
prefs: []
type: TYPE_NORMAL
zh: 下载Jupyter笔记本:ddp_pipeline.ipynb
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-51
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的图库](https://sphinx-gallery.github.io)'
此差异已折叠。
- en: Mobile
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 移动设备
此差异已折叠。
此差异已折叠。
- en: Recommendation Systems
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 推荐系统
- en: Introduction to TorchRec
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: TorchRec简介
- en: 原文:[https://pytorch.org/tutorials/intermediate/torchrec_tutorial.html](https://pytorch.org/tutorials/intermediate/torchrec_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/intermediate/torchrec_tutorial.html](https://pytorch.org/tutorials/intermediate/torchrec_tutorial.html)
- en: Tip
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 提示
- en: To get the most of this tutorial, we suggest using this [Colab Version](https://colab.research.google.com/github/pytorch/torchrec/blob/main/Torchrec_Introduction.ipynb).
This will allow you to experiment with the information presented below.
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 为了充分利用本教程,我们建议使用这个[Colab版本](https://colab.research.google.com/github/pytorch/torchrec/blob/main/Torchrec_Introduction.ipynb)。这将使您能够尝试下面提供的信息。
- en: Follow along with the video below or on [youtube](https://www.youtube.com/watch?v=cjgj41dvSeQ).
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: 请跟随下面的视频或在[youtube](https://www.youtube.com/watch?v=cjgj41dvSeQ)上观看。
- en: '[https://www.youtube.com/embed/cjgj41dvSeQ](https://www.youtube.com/embed/cjgj41dvSeQ)'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '[https://www.youtube.com/embed/cjgj41dvSeQ](https://www.youtube.com/embed/cjgj41dvSeQ)'
- en: When building recommendation systems, we frequently want to represent entities
like products or pages with embeddings. For example, see Meta AI’s [Deep learning
recommendation model](https://arxiv.org/abs/1906.00091), or DLRM. As the number
......@@ -27,162 +39,245 @@
parallelism. To that end, TorchRec introduces its primary API called [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel),
or DMP. Like PyTorch’s DistributedDataParallel, DMP wraps a model to enable distributed
training.
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 在构建推荐系统时,我们经常希望用嵌入来表示产品或页面等实体。例如,参见Meta AI的[深度学习推荐模型](https://arxiv.org/abs/1906.00091),或DLRM。随着实体数量的增长,嵌入表的大小可能超过单个GPU的内存。一种常见做法是将嵌入表分片到不同设备上,这是一种模型并行的类型。为此,TorchRec引入了其主要API称为[`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel),或DMP。与PyTorch的DistributedDataParallel类似,DMP包装了一个模型以实现分布式训练。
- en: Installation
id: totrans-7
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 安装
- en: 'Requirements: python >= 3.7'
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 要求:python >= 3.7
- en: 'We highly recommend CUDA when using TorchRec (If using CUDA: cuda >= 11.0).'
id: totrans-9
prefs: []
type: TYPE_NORMAL
zh: 在使用TorchRec时,我们强烈建议使用CUDA(如果使用CUDA:cuda >= 11.0)。
- en: '[PRE0]'
id: totrans-10
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Overview
id: totrans-11
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 概述
- en: 'This tutorial will cover three pieces of TorchRec: the `nn.module` [`EmbeddingBagCollection`](https://pytorch.org/torchrec/torchrec.modules.html#torchrec.modules.embedding_modules.EmbeddingBagCollection),
the [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel)
API, and the datastructure [`KeyedJaggedTensor`](https://pytorch.org/torchrec/torchrec.sparse.html#torchrec.sparse.jagged_tensor.JaggedTensor).'
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 本教程将涵盖TorchRec的三个部分:`nn.module` [`EmbeddingBagCollection`](https://pytorch.org/torchrec/torchrec.modules.html#torchrec.modules.embedding_modules.EmbeddingBagCollection),[`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel)
API和数据结构[`KeyedJaggedTensor`](https://pytorch.org/torchrec/torchrec.sparse.html#torchrec.sparse.jagged_tensor.JaggedTensor)。
- en: Distributed Setup
id: totrans-13
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 分布式设置
- en: We setup our environment with torch.distributed. For more info on distributed,
see this [tutorial](https://pytorch.org/tutorials/beginner/dist_overview.html).
id: totrans-14
prefs: []
type: TYPE_NORMAL
zh: 我们使用torch.distributed设置我们的环境。有关分布式的更多信息,请参见此[tutorial](https://pytorch.org/tutorials/beginner/dist_overview.html)。
- en: Here, we use one rank (the colab process) corresponding to our 1 colab GPU.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 在这里,我们使用一个rank(colab进程)对应于我们的1个colab GPU。
- en: '[PRE1]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: From EmbeddingBag to EmbeddingBagCollection
id: totrans-17
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 从EmbeddingBag到EmbeddingBagCollection
- en: PyTorch represents embeddings through [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
and [`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html).
EmbeddingBag is a pooled version of Embedding.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: PyTorch通过[`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)和[`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html)来表示嵌入。EmbeddingBag是Embedding的池化版本。
- en: TorchRec extends these modules by creating collections of embeddings. We will
use [`EmbeddingBagCollection`](https://pytorch.org/torchrec/torchrec.modules.html#torchrec.modules.embedding_modules.EmbeddingBagCollection)
to represent a group of EmbeddingBags.
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: TorchRec通过创建嵌入的集合来扩展这些模块。我们将使用[`EmbeddingBagCollection`](https://pytorch.org/torchrec/torchrec.modules.html#torchrec.modules.embedding_modules.EmbeddingBagCollection)来表示一组EmbeddingBags。
- en: Here, we create an EmbeddingBagCollection (EBC) with two embedding bags. Each
table, `product_table` and `user_table`, is represented by a 64 dimension embedding
of size 4096\. Note how we initially allocate the EBC on device “meta”. This will
tell EBC to not allocate memory yet.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 在这里,我们创建了一个包含两个嵌入包的EmbeddingBagCollection(EBC)。每个表,`product_table`和`user_table`,由大小为4096的64维嵌入表示。请注意,我们最初将EBC分配到设备“meta”。这将告诉EBC暂时不要分配内存。
- en: '[PRE2]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: DistributedModelParallel
id: totrans-22
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: DistributedModelParallel
- en: 'Now, we’re ready to wrap our model with [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel)
(DMP). Instantiating DMP will:'
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 现在,我们准备用[`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel)
(DMP)包装我们的模型。实例化DMP将:
- en: Decide how to shard the model. DMP will collect the available ‘sharders’ and
come up with a ‘plan’ of the optimal way to shard the embedding table(s) (i.e.,
the EmbeddingBagCollection).
id: totrans-24
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 决定如何分片模型。DMP将收集可用的“分片器”并提出一种最佳方式来分片嵌入表(即EmbeddingBagCollection)的“计划”。
- en: Actually shard the model. This includes allocating memory for each embedding
table on the appropriate device(s).
id: totrans-25
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 实际分片模型。这包括为每个嵌入表在适当设备上分配内存。
- en: In this toy example, since we have two EmbeddingTables and one GPU, TorchRec
will place both on the single GPU.
id: totrans-26
prefs: []
type: TYPE_NORMAL
zh: 在这个示例中,由于我们有两个EmbeddingTables和一个GPU,TorchRec将两者都放在单个GPU上。
- en: '[PRE3]'
id: totrans-27
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: Query vanilla nn.EmbeddingBag with input and offsets
id: totrans-28
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 使用输入和偏移查询普通的nn.EmbeddingBag
- en: We query [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
and [`nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html)
with `input` and `offsets`. Input is a 1-D tensor containing the lookup values.
Offsets is a 1-D tensor where the sequence is a cumulative sum of the number of
values to pool per example.
id: totrans-29
prefs: []
type: TYPE_NORMAL
zh: 我们使用`input`和`offsets`查询[`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)和[`nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html)。Input是包含查找值的1-D张量。Offsets是一个1-D张量,其中序列是每个示例要汇总的值的累积和。
- en: 'Let’s look at an example, recreating the product EmbeddingBag above:'
id: totrans-30
prefs: []
type: TYPE_NORMAL
zh: 让我们看一个例子,重新创建上面的产品EmbeddingBag:
- en: '[PRE4]'
id: totrans-31
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '[PRE5]'
id: totrans-32
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Representing minibatches with KeyedJaggedTensor
id: totrans-33
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 使用KeyedJaggedTensor表示小批量
- en: We need an efficient representation of multiple examples of an arbitrary number
of entity IDs per feature per example. In order to enable this “jagged” representation,
we use the TorchRec datastructure [`KeyedJaggedTensor`](https://pytorch.org/torchrec/torchrec.sparse.html#torchrec.sparse.jagged_tensor.JaggedTensor)
(KJT).
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 我们需要一个有效的表示,每个示例的每个特征中有任意数量的实体ID的多个示例。为了实现这种“不规则”表示,我们使用TorchRec数据结构[`KeyedJaggedTensor`](https://pytorch.org/torchrec/torchrec.sparse.html#torchrec.sparse.jagged_tensor.JaggedTensor)(KJT)。
- en: Let’s take a look at how to lookup a collection of two embedding bags, “product”
and “user”. Assume the minibatch is made up of three examples for three users.
The first of which has two product IDs, the second with none, and the third with
one product ID.
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: 让我们看看如何查找两个嵌入包“product”和“user”的集合。假设小批量由三个用户的三个示例组成。第一个示例有两个产品ID,第二个没有,第三个有一个产品ID。
- en: '[PRE6]'
id: totrans-36
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: 'The query should be:'
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: 查询应该是:
- en: '[PRE7]'
id: totrans-38
prefs: []
type: TYPE_PRE
zh: '[PRE7]'
- en: Note that the KJT batch size is `batch_size = len(lengths)//len(keys)`. In the
above example, batch_size is 3.
id: totrans-39
prefs: []
type: TYPE_NORMAL
zh: 请注意,KJT批量大小为`batch_size = len(lengths)//len(keys)`。在上面的例子中,batch_size为3。
- en: Putting it all together, querying our distributed model with a KJT minibatch
id: totrans-40
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: 将所有内容整合在一起,使用KJT小批量查询我们的分布式模型
- en: Finally, we can query our model using our minibatch of products and users.
id: totrans-41
prefs: []
type: TYPE_NORMAL
zh: 最后,我们可以使用我们的产品和用户的小批量查询我们的模型。
- en: The resulting lookup will contain a KeyedTensor, where each key (or feature)
contains a 2D tensor of size 3x64 (batch_size x embedding_dim).
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: 结果查找将包含一个KeyedTensor,其中每个键(或特征)包含一个大小为3x64(batch_size x embedding_dim)的2D张量。
- en: '[PRE8]'
id: totrans-43
prefs: []
type: TYPE_PRE
zh: '[PRE8]'
- en: More resources
id: totrans-44
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 更多资源
- en: For more information, please see our [dlrm](https://github.com/pytorch/torchrec/tree/main/examples/dlrm)
example, which includes multinode training on the criteo terabyte dataset, using
Meta’s [DLRM](https://arxiv.org/abs/1906.00091).
id: totrans-45
prefs: []
type: TYPE_NORMAL
zh: 有关更多信息,请参阅我们的[dlrm](https://github.com/pytorch/torchrec/tree/main/examples/dlrm)示例,其中包括在criteo
terabyte数据集上进行多节点训练,使用Meta的[DLRM](https://arxiv.org/abs/1906.00091)。
此差异已折叠。
- en: Multimodality
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 多模态
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册