-en:In DDP, the constructor, the forward pass, and the backward pass are distributed
synchronization points. Different processes are expected to launch the same number
of synchronizations and reach these synchronization points in the same order and
...
...
@@ -133,12 +187,16 @@
skewed processing speeds are inevitable due to, e.g., network delays, resource
contentions, or unpredictable workload spikes. To avoid timeouts in these situations,
make sure that you pass a sufficiently large `timeout` value when calling [init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
-en:The backend constructors are called [from Python side](https://github.com/pytorch/pytorch/blob/v1.9.0/torch/distributed/distributed_c10d.py#L643-L650),
so the extension also needs to expose the constructor APIs to Python. This can
be done by adding the following methods. In this example, `store` and `timeout`
are ignored by the `BackendDummy` instantiation method, as those are not used
in this dummy implementation. However, real-world extensions should consider using
the `store` to perform rendezvous and supporting the `timeout` argument.
-en:This is the preparation step which implements `ResNet50` in two model shards.
The code below is borrowed from the [ResNet implementation in torchvision](https://github.com/pytorch/vision/blob/7c077f6a986f05383bcb86b535aedb5a63dd5c4b/torchvision/models/resnet.py#L124).
The `ResNetBase` module contains the common building blocks and attributes for
-en:This tutorial uses a simple example to demonstrate how you can combine [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel)
(DDP) with the [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html)
to combine distributed data parallelism with distributed model parallelism to
train a simple model. Source code of the example can be found [here](https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc).
zh:之前的教程,[Getting Started With Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)和[Getting
Started with Distributed RPC Framework](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html),分别描述了如何执行分布式数据并行和分布式模型并行训练。尽管如此,有几种训练范式可能需要结合这两种技术。例如:
-en:If we have a model with a sparse part (large embedding table) and a dense part
(FC layers), we might want to put the embedding table on a parameter server and
replicate the FC layer across multiple trainers using [DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
The [Distributed RPC framework](https://pytorch.org/docs/master/rpc.html) can
be used to perform embedding lookups on the parameter server.
-en:In this tutorial, we will split a Transformer model across two GPUs and use
pipeline parallelism to train the model. In addition to this, we use [Distributed
Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
...
...
@@ -93,104 +126,156 @@
are on another. To do this, we pull out the `Encoder` and `Decoder` sections into
separate modules and then build an `nn.Sequential` representing the original Transformer
module.
id:totrans-17
prefs:[]
type:TYPE_NORMAL
zh:在本教程中,我们将一个Transformer模型分割到两个GPU上,并使用管道并行来训练模型。除此之外,我们使用[Distributed Data Parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)来训练这个管道的两个副本。我们有一个进程在GPU
0和1之间驱动一个管道,另一个进程在GPU 2和3之间驱动一个管道。然后,这两个进程使用Distributed Data Parallel来训练这两个副本。模型与[使用nn.Transformer和TorchText进行序列到序列建模](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)教程中使用的模型完全相同,但被分成了两个阶段。最多的参数属于[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)层。[nn.TransformerEncoder](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)本身由`nlayers`个[nn.TransformerEncoderLayer](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)组成。因此,我们的重点是`nn.TransformerEncoder`,我们将模型分割成一半的`nn.TransformerEncoderLayer`在一个GPU上,另一半在另一个GPU上。为此,我们将`Encoder`和`Decoder`部分提取到单独的模块中,然后构建一个代表原始Transformer模块的`nn.Sequential`。
-en:'[PRE1]'
id:totrans-18
prefs:[]
type:TYPE_PRE
zh:'[PRE1]'
-en:Start multiple processes for training
id:totrans-19
prefs:
-PREF_H2
type:TYPE_NORMAL
zh:启动多个进程进行训练
-en:We start two processes where each process drives its own pipeline across two
-en:To get the most of this tutorial, we suggest using this [Colab Version](https://colab.research.google.com/github/pytorch/torchrec/blob/main/Torchrec_Introduction.ipynb).
This will allow you to experiment with the information presented below.
-en:When building recommendation systems, we frequently want to represent entities
like products or pages with embeddings. For example, see Meta AI’s [Deep learning
recommendation model](https://arxiv.org/abs/1906.00091), or DLRM. As the number
...
...
@@ -27,162 +39,245 @@
parallelism. To that end, TorchRec introduces its primary API called [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel),
or DMP. Like PyTorch’s DistributedDataParallel, DMP wraps a model to enable distributed