# 张量并行 - torch.distributed.tensor.parallel

> 原文：[`pytorch.org/docs/stable/distributed.tensor.parallel.html`](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)

张量并行（TP）建立在 PyTorch 分布式张量（[DTensor](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md)）之上，并提供不同的并行化样式：列并行和行并行。

警告

张量并行 API 是实验性的，可能会发生变化。

使用张量并行并行化您的`nn.Module`的入口点是：

```py
torch.distributed.tensor.parallel.parallelize_module(module, device_mesh, parallelize_plan, tp_mesh_dim=0)¶
```

通过根据用户指定的计划并行化模块或子模块来应用 PyTorch 中的张量并行。

我们根据并行化计划并行化模块或子模块。并行化计划包含`ParallelStyle`，指示用户希望如何并行化模块或子模块。

用户还可以根据模块的完全限定名称（FQN）指定不同的并行样式。

请注意，`parallelize_module`仅接受 1-D 的`DeviceMesh`，如果您有 2-D 或 N-D 的`DeviceMesh`，请先将 DeviceMesh 切片为 1-D 子 DeviceMesh，然后将其传递给此 API（即`device_mesh["tp"]`）

参数

+   **module**（`nn.Module`）- 要并行化的模块。

+   **device_mesh**（`DeviceMesh`）- 描述 DTensor 设备网格拓扑的对象。

+   **parallelize_plan**（Union[`ParallelStyle`, Dict[str, `ParallelStyle`]]) – 用于并行化模块的计划。可以是一个包含我们如何为张量并行准备输入/输出的`ParallelStyle`对象，也可以是模块 FQN 及其对应的`ParallelStyle`对象的字典。

+   **tp_mesh_dim**（[*int*](https://docs.python.org/3/library/functions.html#int "(in Python v3.12)")*,* *已弃用*）- 在其中执行张量并行的`device_mesh`的维度，此字段已弃用，并将在将来删除。如果您有一个 2-D 或 N-D 的`DeviceMesh`，请考虑传递 device_mesh[“tp”]

返回

一个`nn.Module`对象并行化。

返回类型

*Module*

示例::

```py
>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>> 
```

注意

对于像 Attention、MLP 层这样的复杂模块架构，我们建议将不同的 ParallelStyles 组合在一起（即`ColwiseParallel`和`RowwiseParallel`），并将其作为并行化计划传递，以实现所需的分片计算。

张量并行支持以下并行样式：

```py
class torch.distributed.tensor.parallel.ColwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)¶
```

以行方式对兼容的 nn.Module 进行分区。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 RowwiseParallel 组合在一起，以实现更复杂模块的分片（即 MLP、Attention）

关键字参数

+   **input_layouts**（*Placement**,* *可选*）- 用于 nn.Module 的输入张量的 DTensor 布局，用于注释输入张量以成为 DTensor。如果未指定，则我们假定输入张量是复制的。

+   **output_layouts**（*Placement**,* *可选*）- 用于 nn.Module 输出的 DTensor 布局，用于确保 nn.Module 的输出具有用户期望的布局。如果未指定，则输出张量在最后一个维度上进行分片。

+   **use_local_output**（[*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")*,* *可选*）- 是否使用本地`torch.Tensor`而不是`DTensor`作为模块输出，默认值为 True。

返回

表示 nn.Module 的 Colwise 分片的`ParallelStyle`对象。

示例::

```py
>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> ...
>>> # By default, the input of the "w1" Linear will be annotated to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w1": ColwiseParallel()},
>>> )
>>> ... 
```

注意

默认情况下，如果未指定`output_layouts`，则`ColwiseParallel`输出在最后一个维度上进行分片，如果有需要特定张量形状的运算符（即在配对的`RowwiseParallel`之前），请记住，如果输出被分片，运算符可能需要调整为分片大小。

```py
class torch.distributed.tensor.parallel.RowwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)¶
```

将兼容的 nn.Module 按行划分。目前仅支持 nn.Linear。用户可以将其与 ColwiseParallel 组合，以实现更复杂模块的分片（即 MLP，Attention）

关键参数

+   **input_layouts** (*Placement**,* *optional*) – nn.Module 的输入张量的 DTensor 布局，用于注释输入张量以成为 DTensor。如果未指定，我们假定输入张量在最后一个维度上被分片。

+   **output_layouts** (*Placement**,* *optional*) – nn.Module 输出的 DTensor 布局，用于确保 nn.Module 的输出具有用户期望的布局。如果未指定，则输出张量将被复制。

+   **use_local_output** ([*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")*,* *optional*) – 是否使用本地`torch.Tensor`而不是`DTensor`作为模块输出，默认值为 True。

返回

代表 nn.Module 的 Rowwise 分片的`ParallelStyle`对象。

示例::

```py
>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> ...
>>> # By default, the input of the "w2" Linear will be annotated to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w2": RowwiseParallel()},
>>> )
>>> ... 
```

要简单配置 nn.Module 的输入和输出以及执行必要的布局重分配，而不将模块参数分发到 DTensors，可以在`parallelize_module`的`parallelize_plan`中使用以下类：

```py
class torch.distributed.tensor.parallel.PrepareModuleInput(*, input_layouts, desired_input_layouts, use_local_output=False)¶
```

根据`input_layouts`配置 nn.Module 的输入，根据`desired_input_layouts`执行布局重分配，将 nn.Module 的输入张量转换为 DTensors。

关键参数

+   **input_layouts** (*Union**[**Placement**,* *Tuple**[**Placement**]**]*) – nn.Module 的输入张量的 DTensor 布局，用于将输入张量转换为 DTensors。如果某些输入不是 torch.Tensor 或不需要转换为 DTensors，则需要指定`None`作为占位符。

+   **desired_input_layouts** (*Union**[**Placement**,* *Tuple**[**Placement**]**]*) – nn.Module 输入张量的期望 DTensor 布局，用于确保 nn.Module 的输入具有期望的 DTensor 布局。此参数需要与`input_layouts`具有相同的长度。

+   **use_local_output** ([*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")*,* *optional*) – 是否使用本地`torch.Tensor`而不是`DTensor`作为模块输入，默认值为 False。

返回

准备 nn.Module 输入的分片布局的`ParallelStyle`对象。

示例::

```py
>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> ) 
```

```py
class torch.distributed.tensor.parallel.PrepareModuleOutput(*, output_layouts, desired_output_layouts, use_local_output=True)¶
```

根据`output_layouts`配置 nn.Module 的输出，根据`desired_output_layouts`执行布局重分配，将 nn.Module 的输出张量转换为 DTensors。

关键参数

+   **output_layouts** (*Union**[**Placement**,* *Tuple**[**Placement**]**]*) – nn.Module 输出张量的 DTensor 布局，用于将输出张量转换为 DTensors（如果它们是`torch.Tensor`）。如果某些输出不是 torch.Tensor 或不需要转换为 DTensors，则需要指定`None`作为占位符。

+   **desired_output_layouts** (*Union**[**Placement**,* *Tuple**[**Placement**]**]*) – nn.Module 输出张量的期望 DTensor 布局，用于确保 nn.Module 的输出具有期望的 DTensor 布局。

+   **use_local_output** ([*bool*](https://docs.python.org/3/library/functions.html#bool "(in Python v3.12)")*,* *optional*) – 是否使用本地`torch.Tensor`而不是`DTensor`作为模块输出，默认值为 False。

返回

准备 nn.Module 输出的分片布局的`ParallelStyle`对象。

示例::

```py
>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "submodule": PrepareModuleOutput(
>>>             output_layouts=Replicate(),
>>>             desired_output_layouts=Shard(0)
>>>         ),
>>>     }
>>> ) 
```

对于 Transformer 等模型，我们建议用户在`parallelize_plan`中同时使用`ColwiseParallel`和`RowwiseParallel`来实现整个模型的期望分片（即 Attention 和 MLP）。