提交 a60b5fd8 编写于 作者: 绝不原创的飞龙's avatar 绝不原创的飞龙

2024-02-04 14:21:34

上级 26c2d4ae
......@@ -3,24 +3,30 @@
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: PyTorch 分布式概述
- en: 原文:[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/beginner/dist_overview.html](https://pytorch.org/tutorials/beginner/dist_overview.html)
- en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Shen Li](https://mrshenli.github.io/)'
- en: Note
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) View and edit this
tutorial in [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst).'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '![edit](../Images/a8aa37bcc5edbf2ba5fcf18dba1e55f9.png) [github](https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst)
中查看并编辑本教程。'
- en: This is the overview page for the `torch.distributed` package. The goal of this
page is to categorize documents into different topics and briefly describe each
of them. If this is your first time building distributed training applications
......@@ -29,16 +35,19 @@
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 这是 `torch.distributed` 包的概述页面。本页面的目标是将文档分类为不同主题,并简要描述每个主题。如果这是您第一次使用PyTorch构建分布式训练应用程序,建议使用本文档导航到最适合您用例的技术。
- en: Introduction
id: totrans-6
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 介绍
- en: 'As of PyTorch v1.6.0, features in `torch.distributed` can be categorized into
three main components:'
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 截至PyTorch v1.6.0,`torch.distributed` 中的功能可以分为三个主要组件:
- en: '[Distributed Data-Parallel Training](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
(DDP) is a widely adopted single-program multiple-data training paradigm. With
DDP, the model is replicated on every process, and every model replica will be
......@@ -49,6 +58,7 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[分布式数据并行训练](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)(DDP)是一种广泛采用的单程序多数据训练范式。使用DDP,模型在每个进程上被复制,并且每个模型副本将被提供不同的输入数据样本。DDP负责梯度通信以保持模型副本同步,并将其与梯度计算重叠以加快训练速度。'
- en: '[RPC-Based Distributed Training](https://pytorch.org/docs/stable/rpc.html)
(RPC) supports general training structures that cannot fit into data-parallel
training such as distributed pipeline parallelism, parameter server paradigm,
......@@ -59,6 +69,7 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[基于RPC的分布式训练](https://pytorch.org/docs/stable/rpc.html)(RPC)支持无法适应数据并行训练的一般训练结构,如分布式管道并行性、参数服务器范式以及DDP与其他训练范式的组合。它有助于管理远程对象的生命周期,并将[autograd引擎](https://pytorch.org/docs/stable/autograd.html)扩展到机器边界之外。'
- en: '[Collective Communication](https://pytorch.org/docs/stable/distributed.html)
(c10d) library supports sending tensors across processes within a group. It offers
both collective communication APIs (e.g., [all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)
......@@ -81,6 +92,9 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[集体通信](https://pytorch.org/docs/stable/distributed.html)(c10d)库支持在组内的进程之间发送张量。它提供了集体通信API(例如,[all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce)和[all_gather](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather))以及P2P通信API(例如,[send](https://pytorch.org/docs/stable/distributed.html#torch.distributed.send)和[isend](https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend))。DDP和RPC([ProcessGroup
Backend](https://pytorch.org/docs/stable/rpc.html#process-group-backend))构建在c10d之上,前者使用集体通信,后者使用P2P通信。通常,开发人员不需要直接使用这个原始通信API,因为DDP和RPC
API可以满足许多分布式训练场景。然而,仍有一些用例可以从这个API中获益。一个例子是分布式参数平均化,应用程序希望在反向传播后计算所有模型参数的平均值,而不是使用DDP来通信梯度。这可以将通信与计算分离,并允许更精细地控制要通信的内容,但另一方面,也放弃了DDP提供的性能优化。[使用PyTorch编写分布式应用程序](../intermediate/dist_tuto.html)展示了使用c10d通信API的示例。'
- en: Data Parallel Training
id: totrans-11
prefs:
......@@ -125,6 +139,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 如果应用程序需要跨机器边界扩展,请使用多机器[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)和[启动脚本](https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md)。
- en: Use multi-GPU [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)
training on a single-machine or multi-machine when the data and model cannot fit
on one GPU.
......@@ -132,6 +147,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 当数据和模型无法放入一个GPU时,在单机或多机上使用多GPU的[FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)训练。
- en: Use [torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
to launch distributed training if errors (e.g., out-of-memory) are expected or
if resources can join and leave dynamically during training.
......@@ -139,19 +155,23 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 使用[torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)来启动分布式训练,如果预期会出现错误(例如,内存不足),或者在训练过程中资源可以动态加入和离开。
- en: Note
id: totrans-19
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Data-parallel training also works with [Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus).
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: 数据并行训练也可以与[Automatic Mixed Precision (AMP)](https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus)一起使用。
- en: '`torch.nn.DataParallel`'
id: totrans-21
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: '`torch.nn.DataParallel`'
- en: 'The [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
package enables single-machine multi-GPU parallelism with the lowest coding hurdle.
It only requires a one-line change to the application code. The tutorial [Optional:
......@@ -163,11 +183,16 @@
id: totrans-22
prefs: []
type: TYPE_NORMAL
zh: '[DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html)
包能够在单机多GPU上实现并行计算,且编码难度最低。只需要在应用代码中进行一行更改。教程 [Optional: Data Parallelism](../beginner/blitz/data_parallel_tutorial.html)
展示了一个例子。虽然 `DataParallel` 很容易使用,但通常性能不是最佳的,因为它在每次前向传播中都会复制模型,并且其单进程多线程并行自然受到 [GIL](https://wiki.python.org/moin/GlobalInterpreterLock)
的影响。为了获得更好的性能,考虑使用 [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)。'
- en: '`torch.nn.parallel.DistributedDataParallel`'
id: totrans-23
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: '`torch.nn.parallel.DistributedDataParallel`'
- en: Compared to [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html),
[DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)
requires one more step to set up, i.e., calling [init_process_group](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
......@@ -219,6 +244,8 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[Shard Optimizer States With ZeroRedundancyOptimizer](../recipes/zero_redundancy_optimizer.html)
配方演示了如何使用[ZeroRedundancyOptimizer](https://pytorch.org/docs/stable/distributed.optim.html)来减少优化器的内存占用。'
- en: The [Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
tutorial walks through using the generic join context for distributed training
with uneven inputs.
......@@ -226,11 +253,14 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[Distributed Training with Uneven Inputs Using the Join Context Manager](../advanced/generic_join.html)
教程介绍了如何使用通用的连接上下文管理器进行不均匀输入的分布式训练。'
- en: '`torch.distributed.FullyShardedDataParallel`'
id: totrans-31
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: '`torch.distributed.FullyShardedDataParallel`'
- en: The [FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html) (FSDP)
is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
parameters, gradients and optimizer states, it shards all of these states across
......@@ -240,11 +270,14 @@
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: '[FullyShardedDataParallel](https://pytorch.org/docs/stable/fsdp.html)(FSDP)是一种数据并行范例,它在每个GPU上维护模型参数、梯度和优化器状态的副本,将所有这些状态分片到数据并行工作器中。对FSDP的支持从PyTorch
v1.11开始添加。教程[Getting Started with FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html)提供了关于FSDP如何工作的深入解释和示例。'
- en: torch.distributed.elastic
id: totrans-33
prefs:
- PREF_H3
type: TYPE_NORMAL
zh: torch.distributed.elastic
- en: With the growth of the application complexity and scale, failure recovery becomes
a requirement. Sometimes it is inevitable to hit errors like out-of-memory (OOM)
when using DDP, but DDP itself cannot recover from those errors, and it is not
......@@ -258,11 +291,14 @@
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 随着应用程序复杂性和规模的增长,故障恢复变得必不可少。在使用DDP时,有时会不可避免地遇到诸如内存溢出(OOM)等错误,但DDP本身无法从这些错误中恢复,也无法使用标准的`try-except`结构来处理它们。这是因为DDP要求所有进程以密切同步的方式运行,并且在不同进程中启动的所有`AllReduce`通信必须匹配。如果组中的一个进程抛出异常,很可能会导致不同步(不匹配的`AllReduce`操作),从而导致崩溃或挂起。[torch.distributed.elastic](https://pytorch.org/docs/stable/distributed.elastic.html)
添加了容错性和利用动态机器池(弹性)的能力。
- en: RPC-Based Distributed Training
id: totrans-35
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 基于RPC的分布式训练
- en: Many training paradigms do not fit into data parallelism, e.g., parameter server
paradigm, distributed pipeline parallelism, reinforcement learning applications
with multiple observers or agents, etc. [torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
......@@ -270,17 +306,21 @@
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: 许多训练范式不适合数据并行ism,例如参数服务器范式、分布式管道并行ism、具有多个观察者或代理的强化学习应用等。[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html)
旨在支持一般的分布式训练场景。
- en: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) has four
main pillars:'
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: '[torch.distributed.rpc](https://pytorch.org/docs/stable/rpc.html) 有四个主要支柱:'
- en: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) supports running a given
function on a remote worker.'
id: totrans-38
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[RPC](https://pytorch.org/docs/stable/rpc.html#rpc) 支持在远程工作节点上运行给定函数。'
- en: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) helps to manage the lifetime
of a remote object. The reference counting protocol is presented in the [RRef
notes](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol).'
......@@ -288,6 +328,7 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[RRef](https://pytorch.org/docs/stable/rpc.html#rref) 帮助管理远程对象的生命周期。引用计数协议在[RRef笔记](https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol)中介绍。'
- en: '[Distributed Autograd](https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework)
extends the autograd engine beyond machine boundaries. Please refer to [Distributed
Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)
......@@ -296,6 +337,8 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[Distributed Autograd](https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework)
将自动求导引擎扩展到机器边界之外。更多细节请参考[Distributed Autograd Design](https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design)。'
- en: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
automatically reaches out to all participating workers to update parameters using
gradients computed by the distributed autograd engine.'
......@@ -303,10 +346,13 @@
prefs:
- PREF_UL
type: TYPE_NORMAL
zh: '[Distributed Optimizer](https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim)
自动联系所有参与的工作节点,使用分布式自动求导引擎计算的梯度来更新参数。'
- en: 'RPC Tutorials are listed below:'
id: totrans-42
prefs: []
type: TYPE_NORMAL
zh: RPC教程如下:
- en: The [Getting Started with Distributed RPC Framework](../intermediate/rpc_tutorial.html)
tutorial first uses a simple Reinforcement Learning (RL) example to demonstrate
RPC and RRef. Then, it applies a basic distributed model parallelism to an RNN
......@@ -315,6 +361,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用分布式RPC框架入门](../intermediate/rpc_tutorial.html) 教程首先使用一个简单的强化学习(RL)示例来演示RPC和RRef。然后,它将基本的分布式模型并行应用到一个RNN示例中,展示如何使用分布式自动求导和分布式优化器。'
- en: The [Implementing a Parameter Server Using Distributed RPC Framework](../intermediate/rpc_param_server_tutorial.html)
tutorial borrows the spirit of [HogWild! training](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)
and applies it to an asynchronous parameter server (PS) training application.
......@@ -322,6 +369,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用分布式RPC框架实现参数服务器](../intermediate/rpc_param_server_tutorial.html) 教程借鉴了[HogWild!训练](https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf)的精神,并将其应用于异步参数服务器(PS)训练应用。'
- en: The [Distributed Pipeline Parallelism Using RPC](../intermediate/dist_pipeline_parallel_tutorial.html)
tutorial extends the single-machine pipeline parallel example (presented in [Single-Machine
Model Parallel Best Practices](../intermediate/model_parallel_tutorial.html))
......@@ -330,6 +378,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用RPC实现分布式管道并行性](../intermediate/dist_pipeline_parallel_tutorial.html) 教程将单机管道并行示例(在[单机模型并行最佳实践](../intermediate/model_parallel_tutorial.html)中介绍)扩展到分布式环境,并展示如何使用RPC实现它。'
- en: The [Implementing Batch RPC Processing Using Asynchronous Executions](../intermediate/rpc_async_execution.html)
tutorial demonstrates how to implement RPC batch processing using the [@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)
decorator, which can help speed up inference and training. It uses RL and PS examples
......@@ -338,6 +387,7 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用异步执行实现批量RPC处理的教程](../intermediate/rpc_async_execution.html)演示了如何使用[@rpc.functions.async_execution](https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution)装饰器实现RPC批处理,这可以帮助加速推理和训练。它使用了类似于上述教程1和2中的RL和PS示例。'
- en: The [Combining Distributed DataParallel with Distributed RPC Framework](../advanced/rpc_ddp_tutorial.html)
tutorial demonstrates how to combine DDP with RPC to train a model using distributed
data parallelism combined with distributed model parallelism.
......@@ -345,13 +395,16 @@
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[将分布式数据并行与分布式RPC框架相结合的教程](../advanced/rpc_ddp_tutorial.html)演示了如何将DDP与RPC结合起来,使用分布式数据并行性和分布式模型并行性来训练模型。'
- en: PyTorch Distributed Developers
id: totrans-48
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: PyTorch分布式开发者
- en: If you’d like to contribute to PyTorch Distributed, please refer to our [Developer
Guide](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md).
id: totrans-49
prefs: []
type: TYPE_NORMAL
zh: 如果您想为PyTorch分布式做出贡献,请参考我们的[开发者指南](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md)。
- en: Distributed Data Parallel in PyTorch - Video Tutorials
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: PyTorch中的分布式数据并行 - 视频教程
- en: 原文:[https://pytorch.org/tutorials/beginner/ddp_series_intro.html](https://pytorch.org/tutorials/beginner/ddp_series_intro.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/beginner/ddp_series_intro.html](https://pytorch.org/tutorials/beginner/ddp_series_intro.html)
- en: '**Introduction** || [What is DDP](ddp_series_theory.html) || [Single-Node Multi-GPU
Training](ddp_series_multigpu.html) || [Fault Tolerance](ddp_series_fault_tolerance.html)
|| [Multi-Node training](../intermediate/ddp_series_multinode.html) || [minGPT
Training](../intermediate/ddp_series_minGPT.html)'
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: '**介绍** || [什么是DDP](ddp_series_theory.html) || [单节点多GPU训练](ddp_series_multigpu.html)
|| [容错性](ddp_series_fault_tolerance.html) || [多节点训练](../intermediate/ddp_series_multinode.html)
|| [minGPT训练](../intermediate/ddp_series_minGPT.html)'
- en: 'Authors: [Suraj Subramanian](https://github.com/suraj813)'
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 作者:[Suraj Subramanian](https://github.com/suraj813)
- en: Follow along with the video below or on [youtube](https://www.youtube.com/watch/-K3bZYHYHEA).
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: 跟随下面的视频或在[youtube](https://www.youtube.com/watch/-K3bZYHYHEA)上观看。
- en: '[https://www.youtube.com/embed/-K3bZYHYHEA](https://www.youtube.com/embed/-K3bZYHYHEA)'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: '[https://www.youtube.com/embed/-K3bZYHYHEA](https://www.youtube.com/embed/-K3bZYHYHEA)'
- en: This series of video tutorials walks you through distributed training in PyTorch
via DDP.
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 这一系列视频教程将带您了解通过DDP在PyTorch中进行分布式训练。
- en: The series starts with a simple non-distributed training job, and ends with
deploying a training job across several machines in a cluster. Along the way,
you will also learn about [torchrun](https://pytorch.org/docs/stable/elastic/run.html)
for fault-tolerant distributed training.
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 该系列从简单的非分布式训练作业开始,最终部署到集群中的多台机器上进行训练。在此过程中,您还将了解到关于[torchrun](https://pytorch.org/docs/stable/elastic/run.html)用于容错分布式训练。
- en: The tutorial assumes a basic familiarity with model training in PyTorch.
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 本教程假定您对PyTorch中的模型训练有基本的了解。
- en: Running the code
id: totrans-9
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 运行代码
- en: You will need multiple CUDA GPUs to run the tutorial code. Typically, this can
be done on a cloud instance with multiple GPUs (the tutorials use an Amazon EC2
P3 instance with 4 GPUs).
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 您需要多个CUDA GPU来运行教程代码。通常可以在具有多个GPU的云实例上完成此操作(教程使用具有4个GPU的Amazon EC2 P3实例)。
- en: The tutorial code is hosted in this [github repo](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series).
Clone the repository and follow along!
id: totrans-11
prefs: []
type: TYPE_NORMAL
zh: 教程代码托管在这个[github仓库](https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series)。克隆该仓库并跟随教程!
- en: Tutorial sections
id: totrans-12
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 教程部分
- en: Introduction (this page)
id: totrans-13
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: 介绍(本页)
- en: '[What is DDP?](ddp_series_theory.html) Gently introduces what DDP is doing
under the hood'
id: totrans-14
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[DDP是什么?](ddp_series_theory.html) 温和地介绍了DDP在幕后的工作'
- en: '[Single-Node Multi-GPU Training](ddp_series_multigpu.html) Training models
using multiple GPUs on a single machine'
id: totrans-15
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[单节点多GPU训练](ddp_series_multigpu.html) 在单台机器上使用多个GPU训练模型'
- en: '[Fault-tolerant distributed training](ddp_series_fault_tolerance.html) Making
your distributed training job robust with torchrun'
id: totrans-16
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[容错分布式训练](ddp_series_fault_tolerance.html) 使用torchrun使您的分布式训练工作更加稳健'
- en: '[Multi-Node training](../intermediate/ddp_series_multinode.html) Training models
using multiple GPUs on multiple machines'
id: totrans-17
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[多节点训练](../intermediate/ddp_series_multinode.html) 使用多台机器上的多个GPU进行模型训练'
- en: '[Training a GPT model with DDP](../intermediate/ddp_series_minGPT.html) “Real-world”
example of training a [minGPT](https://github.com/karpathy/minGPT) model with
DDP'
id: totrans-18
prefs:
- PREF_OL
type: TYPE_NORMAL
zh: '[使用DDP训练GPT模型](../intermediate/ddp_series_minGPT.html) 使用DDP训练[minGPT](https://github.com/karpathy/minGPT)模型的“真实世界”示例'
- en: Single-Machine Model Parallel Best Practices
id: totrans-0
prefs:
- PREF_H1
type: TYPE_NORMAL
zh: 单机模型并行最佳实践
- en: 原文:[https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
id: totrans-1
prefs:
- PREF_BQ
type: TYPE_NORMAL
zh: 原文:[https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
- en: Note
id: totrans-2
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: Click [here](#sphx-glr-download-intermediate-model-parallel-tutorial-py) to
download the full example code
id: totrans-3
prefs: []
type: TYPE_NORMAL
zh: 点击[这里](#sphx-glr-download-intermediate-model-parallel-tutorial-py)下载完整示例代码
- en: '**Author**: [Shen Li](https://mrshenli.github.io/)'
id: totrans-4
prefs: []
type: TYPE_NORMAL
zh: '**作者**:[Shen Li](https://mrshenli.github.io/)'
- en: 'Model parallel is widely-used in distributed training techniques. Previous
posts have explained how to use [DataParallel](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html)
to train a neural network on multiple GPUs; this feature replicates the same model
......@@ -27,8 +37,10 @@
the entire model on each GPU (to be concrete, say a model `m` contains 10 layers:
when using `DataParallel`, each GPU will have a replica of each of these 10 layers,
whereas when using model parallel on two GPUs, each GPU could host 5 layers).'
id: totrans-5
prefs: []
type: TYPE_NORMAL
zh: 模型并行在分布式训练技术中被广泛使用。先前的帖子已经解释了如何使用[DataParallel](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html)在多个GPU上训练神经网络;这个功能将相同的模型复制到所有GPU上,每个GPU消耗不同的输入数据分区。虽然它可以显著加速训练过程,但对于一些模型太大无法适应单个GPU的情况,它无法工作。这篇文章展示了如何通过使用**模型并行**来解决这个问题,与`DataParallel`相反,它将单个模型分割到不同的GPU上,而不是在每个GPU上复制整个模型(具体来说,假设一个模型`m`包含10层:使用`DataParallel`时,每个GPU将有这10层的副本,而使用两个GPU上的模型并行时,每个GPU可以承载5层)。
- en: The high-level idea of model parallel is to place different sub-networks of
a model onto different devices, and implement the `forward` method accordingly
to move intermediate outputs across devices. As only part of a model operates
......@@ -36,84 +48,118 @@
In this post, we will not try to construct huge models and squeeze them into a
limited number of GPUs. Instead, this post focuses on showing the idea of model
parallel. It is up to the readers to apply the ideas to real-world applications.
id: totrans-6
prefs: []
type: TYPE_NORMAL
zh: 模型并行的高级思想是将模型的不同子网络放置在不同的设备上,并相应地实现`forward`方法以在设备之间传递中间输出。由于模型的部分在任何单独的设备上运行,一组设备可以共同为一个更大的模型提供服务。在这篇文章中,我们不会尝试构建庞大的模型并将它们压缩到有限数量的GPU中。相反,这篇文章侧重于展示模型并行的思想。读者可以将这些思想应用到现实世界的应用中。
- en: Note
id: totrans-7
prefs: []
type: TYPE_NORMAL
zh: 注意
- en: For distributed model parallel training where a model spans multiple servers,
please refer to [Getting Started With Distributed RPC Framework](rpc_tutorial.html)
for examples and details.
id: totrans-8
prefs: []
type: TYPE_NORMAL
zh: 对于跨多个服务器的分布式模型并行训练,请参考[使用分布式RPC框架入门](rpc_tutorial.html)以获取示例和详细信息。
- en: Basic Usage
id: totrans-9
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 基本用法
- en: Let us start with a toy model that contains two linear layers. To run this model
on two GPUs, simply put each linear layer on a different GPU, and move inputs
and intermediate outputs to match the layer devices accordingly.
id: totrans-10
prefs: []
type: TYPE_NORMAL
zh: 让我们从一个包含两个线性层的玩具模型开始。要在两个GPU上运行这个模型,只需将每个线性层放在不同的GPU上,并将输入和中间输出移动到匹配层设备的位置。
- en: '[PRE0]'
id: totrans-11
prefs: []
type: TYPE_PRE
zh: '[PRE0]'
- en: Note that, the above `ToyModel` looks very similar to how one would implement
it on a single GPU, except the four `to(device)` calls which place linear layers
and tensors on proper devices. That is the only place in the model that requires
changes. The `backward()` and `torch.optim` will automatically take care of gradients
as if the model is on one GPU. You only need to make sure that the labels are
on the same device as the outputs when calling the loss function.
id: totrans-12
prefs: []
type: TYPE_NORMAL
zh: 请注意,上面的`ToyModel`看起来与在单个GPU上实现它的方式非常相似,除了四个`to(device)`调用,这些调用将线性层和张量放置在适当的设备上。这是模型中唯一需要更改的地方。`backward()`和`torch.optim`将自动处理梯度,就好像模型在一个GPU上一样。您只需要确保在调用损失函数时标签与输出在同一设备上。
- en: '[PRE1]'
id: totrans-13
prefs: []
type: TYPE_PRE
zh: '[PRE1]'
- en: Apply Model Parallel to Existing Modules
id: totrans-14
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 将模型并行应用于现有模块
- en: It is also possible to run an existing single-GPU module on multiple GPUs with
just a few lines of changes. The code below shows how to decompose `torchvision.models.resnet50()`
to two GPUs. The idea is to inherit from the existing `ResNet` module, and split
the layers to two GPUs during construction. Then, override the `forward` method
to stitch two sub-networks by moving the intermediate outputs accordingly.
id: totrans-15
prefs: []
type: TYPE_NORMAL
zh: 也可以通过只更改几行代码在多个GPU上运行现有的单GPU模块。下面的代码显示了如何将`torchvision.models.resnet50()`分解为两个GPU。思路是继承现有的`ResNet`模块,并在构造过程中将层分割到两个GPU上。然后,重写`forward`方法,通过相应地移动中间输出来拼接两个子网络。
- en: '[PRE2]'
id: totrans-16
prefs: []
type: TYPE_PRE
zh: '[PRE2]'
- en: The above implementation solves the problem for cases where the model is too
large to fit into a single GPU. However, you might have already noticed that it
will be slower than running it on a single GPU if your model fits. It is because,
at any point in time, only one of the two GPUs are working, while the other one
is sitting there doing nothing. The performance further deteriorates as the intermediate
outputs need to be copied from `cuda:0` to `cuda:1` between `layer2` and `layer3`.
id: totrans-17
prefs: []
type: TYPE_NORMAL
zh: 上述实现解决了模型过大无法适应单个GPU的情况。然而,您可能已经注意到,如果您的模型适合单个GPU,则运行速度会比在单个GPU上运行要慢。这是因为,在任何时候,只有两个GPU中的一个在工作,而另一个则闲置。性能进一步恶化,因为需要在`layer2`和`layer3`之间将中间输出从`cuda:0`复制到`cuda:1`。
- en: Let us run an experiment to get a more quantitative view of the execution time.
In this experiment, we train `ModelParallelResNet50` and the existing `torchvision.models.resnet50()`
by running random inputs and labels through them. After the training, the models
will not produce any useful predictions, but we can get a reasonable understanding
of the execution times.
id: totrans-18
prefs: []
type: TYPE_NORMAL
zh: 让我们进行一个实验,以更量化地了解执行时间。在这个实验中,我们通过将随机输入和标签传递给它们来训练`ModelParallelResNet50`和现有的`torchvision.models.resnet50()`。训练之后,模型将不会产生任何有用的预测,但我们可以对执行时间有一个合理的了解。
- en: '[PRE3]'
id: totrans-19
prefs: []
type: TYPE_PRE
zh: '[PRE3]'
- en: The `train(model)` method above uses `nn.MSELoss` as the loss function, and
`optim.SGD` as the optimizer. It mimics training on `128 X 128` images which are
organized into 3 batches where each batch contains 120 images. Then, we use `timeit`
to run the `train(model)` method 10 times and plot the execution times with standard
deviations.
id: totrans-20
prefs: []
type: TYPE_NORMAL
zh: '上面的`train(model)`方法使用`nn.MSELoss`作为损失函数,使用`optim.SGD`作为优化器。它模拟对`128 X 128`图像进行训练,这些图像被组织成3个批次,每个批次包含120张图像。然后,我们使用`timeit`运行`train(model)`方法10次,并绘制带有标准偏差的执行时间。 '
- en: '[PRE4]'
id: totrans-21
prefs: []
type: TYPE_PRE
zh: '[PRE4]'
- en: '![](../Images/7f2d776cf49fcf3fd44fd84a238a3cc6.png)'
id: totrans-22
prefs: []
type: TYPE_IMG
zh: '![](../Images/7f2d776cf49fcf3fd44fd84a238a3cc6.png)'
- en: The result shows that the execution time of model parallel implementation is
`4.02/3.75-1=7%` longer than the existing single-GPU implementation. So we can
conclude there is roughly 7% overhead in copying tensors back and forth across
......@@ -122,20 +168,28 @@
into a pipeline of splits, such that when one split reaches the second sub-network,
the following split can be fed into the first sub-network. In this way, two consecutive
splits can run concurrently on two GPUs.
id: totrans-23
prefs: []
type: TYPE_NORMAL
zh: 结果显示,模型并行实现的执行时间比现有的单GPU实现长了`4.02/3.75-1=7%`。因此,我们可以得出结论,在跨GPU传输张量时大约有7%的开销。还有改进的空间,因为我们知道两个GPU中的一个在整个执行过程中处于空闲状态。一种选择是将每个批次进一步分成一系列分割的管道,这样当一个分割到达第二个子网络时,接下来的分割可以被送入第一个子网络。这样,两个连续的分割可以在两个GPU上同时运行。
- en: Speed Up by Pipelining Inputs
id: totrans-24
prefs:
- PREF_H2
type: TYPE_NORMAL
zh: 通过流水线输入加速
- en: In the following experiments, we further divide each 120-image batch into 20-image
splits. As PyTorch launches CUDA operations asynchronously, the implementation
does not need to spawn multiple threads to achieve concurrency.
id: totrans-25
prefs: []
type: TYPE_NORMAL
zh: 在以下实验中,我们将每个120张图像批次进一步分成20张图像的拆分。由于PyTorch异步启动CUDA操作,实现不需要生成多个线程来实现并发。
- en: '[PRE5]'
id: totrans-26
prefs: []
type: TYPE_PRE
zh: '[PRE5]'
- en: Please note, device-to-device tensor copy operations are synchronized on current
streams on the source and the destination devices. If you create multiple streams,
you have to make sure that copy operations are properly synchronized. Writing
......@@ -143,11 +197,15 @@
copy operation can lead to undefined behavior. The above implementation only uses
default streams on both source and destination devices, hence it is not necessary
to enforce additional synchronizations.
id: totrans-27
prefs: []
type: TYPE_NORMAL
zh: 请注意,设备之间的张量复制操作在源设备和目标设备上的当前流上是同步的。如果您创建多个流,您必须确保复制操作得到适当的同步。在完成复制操作之前写入源张量或读取/写入目标张量可能导致未定义的行为。上述实现仅在源设备和目标设备上使用默认流,因此不需要强制执行额外的同步。
- en: '![](../Images/48d2e67f025b05eeb9259e249566add3.png)'
id: totrans-28
prefs: []
type: TYPE_IMG
zh: '![](../Images/48d2e67f025b05eeb9259e249566add3.png)'
- en: The experiment result shows that, pipelining inputs to model parallel ResNet50
speeds up the training process by roughly `3.75/2.51-1=49%`. It is still quite
far away from the ideal 100% speedup. As we have introduced a new parameter `split_sizes`
......@@ -157,12 +215,16 @@
long idle times during the first and last splits. Neither are optimal. There might
be an optimal `split_size` configuration for this specific experiment. Let us
try to find it by running experiments using several different `split_size` values.
id: totrans-29
prefs: []
type: TYPE_NORMAL
- en: '[PRE6]'
id: totrans-30
prefs: []
type: TYPE_PRE
zh: '[PRE6]'
- en: '![](../Images/9d53a7aba4b9016ea39aa794905ee059.png)'
id: totrans-31
prefs: []
type: TYPE_IMG
- en: The result shows that setting `split_size` to 12 achieves the fastest training
......@@ -175,27 +237,41 @@
both GPUs, and different sub-network structures require different stream management
strategies. As no general multi-stream solution works for all model parallel use
cases, we will not discuss it in this tutorial.
id: totrans-32
prefs: []
type: TYPE_NORMAL
zh: 结果显示,将`split_size`设置为12可以实现最快的训练速度,从而导致`3.75/2.43-1=54%`的加速。仍然有机会进一步加快训练过程。例如,所有在`cuda:0`上的操作都放在其默认流中。这意味着下一个分割的计算不能与`prev`分割的复制操作重叠。然而,由于`prev`和下一个分割是不同的张量,因此可以将一个的计算与另一个的复制重叠。实现需要在两个GPU上使用多个流,不同的子网络结构需要不同的流管理策略。由于没有通用的多流解决方案适用于所有模型并行使用情况,我们在本教程中不会讨论这个问题。
- en: '**Note:**'
id: totrans-33
prefs: []
type: TYPE_NORMAL
zh: '**注意:**'
- en: This post shows several performance measurements. You might see different numbers
when running the same code on your own machine, because the result depends on
the underlying hardware and software. To get the best performance for your environment,
a proper approach is to first generate the curve to figure out the best split
size, and then use that split size to pipeline inputs.
id: totrans-34
prefs: []
type: TYPE_NORMAL
zh: 本文展示了几个性能测量。当在您自己的机器上运行相同的代码时,您可能会看到不同的数字,因为结果取决于底层硬件和软件。为了在您的环境中获得最佳性能,一个正确的方法是首先生成曲线以找出最佳的拆分大小,然后使用该拆分大小来流水线输入。
- en: '**Total running time of the script:** ( 5 minutes 48.653 seconds)'
id: totrans-35
prefs: []
type: TYPE_NORMAL
zh: '**脚本的总运行时间:**(5分钟48.653秒)'
- en: '[`Download Python source code: model_parallel_tutorial.py`](../_downloads/84ab670fda2216116ac8e3ecd5805f0b/model_parallel_tutorial.py)'
id: totrans-36
prefs: []
type: TYPE_NORMAL
zh: '[`下载Python源代码:model_parallel_tutorial.py`](../_downloads/84ab670fda2216116ac8e3ecd5805f0b/model_parallel_tutorial.py)'
- en: '[`Download Jupyter notebook: model_parallel_tutorial.ipynb`](../_downloads/03a48646520c277662581e858e680809/model_parallel_tutorial.ipynb)'
id: totrans-37
prefs: []
type: TYPE_NORMAL
zh: '[`下载Jupyter笔记本:model_parallel_tutorial.ipynb`](../_downloads/03a48646520c277662581e858e680809/model_parallel_tutorial.ipynb)'
- en: '[Gallery generated by Sphinx-Gallery](https://sphinx-gallery.github.io)'
id: totrans-38
prefs: []
type: TYPE_NORMAL
zh: '[Sphinx-Gallery生成的画廊](https://sphinx-gallery.github.io)'
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册