张量并行性 - torch.distributed.tensor.parallel¶

张量并行（Tensor Parallelism，TP）构建于 PyTorch 分布式张量 (DTensor) 之上，并提供了不同的并行方式：列并行和行并行。

警告

张量并行 API 是实验性的，可能会发生变化。

使用张量并行化的并行化您的 nn.Module 的入口是：

torch.distributed.tensor.parallel.parallelize_module(module, device_mesh, parallelize_plan, tp_mesh_dim=0)[source]¶

在 PyTorch 中应用张量并行性，通过根据用户指定的计划并行化模块或子模块。

我们根据并行计划对模块或子模块进行并行化。该并行计划包含 ParallelStyle，这表示用户希望如何对该模块或子模块进行并行化。

用户还可以为每个模块的完整限定名（FQN）指定不同的并行风格。

注意，parallelize_module 只接受一维的 DeviceMesh，如果你有一个二维或多维的 DeviceMesh，请先将其切片为一维的子设备网格，然后再传递给此API（即 device_mesh["tp"]）

Parameters

模块 (nn.Module) – 需要并行化的模块。
device_mesh (DeviceMesh) – 描述用于 DTensor 的设备网格拓扑结构的对象。
parallelize_plan (Union[ParallelStyle, Dict[str, ParallelStyle]]) – 用于并行化模块的计划。它可以是一个ParallelStyle对象，其中包含我们如何为张量并行准备输入/输出，或者它也可以是一个字典，键为模块的完全限定名称（FQN），值为其对应的ParallelStyle对象。
tp_mesh_dim (int, 已弃用) – 我们执行 Tensor 并行的 device_mesh 维度，此字段已被弃用，并将在将来删除。如果你有一个 2-D 或 N-D 的 DeviceMesh，请考虑传入 device_mesh[“tp”]

Returns

一个 nn.Module 对象并行化。

Return type

模块

Example::

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>>

注意

对于像注意力机制和多层感知机这样的复杂模块架构，我们建议组合不同的并行样式（即 ColwiseParallel 和 RowwiseParallel），并将它们作为并行计划传递，以实现所需的分片计算。

张量并行支持以下并行风格：

class torch.distributed.tensor.parallel.ColwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)[source]¶

按列方式拆分一个兼容的 nn.Module。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 RowwiseParallel 组合使用，以实现更复杂模块（例如 MLP、Attention）的拆分。

Keyword Arguments

input_layouts (Placement, 可选) – nn.Module 的输入张量的 DTensor 布局，用于注解输入张量以成为 DTensor。如果未指定，我们假设输入张量是复制的。
output_layouts (Placement, 可选) – nn.Module 的 DTensor 布局，用于确保 nn.Module 的输出具有用户期望的布局。如果未指定，则输出张量在最后一个维度上进行分片。
use_local_output (bool, 可选) – 是否使用本地 torch.Tensor 而不是 DTensor 作为模块输出，默认值：True。

Returns

一个 ParallelStyle 对象，表示 nn.Module 的列分片。

Example::

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> ...
>>> # By default, the input of the "w1" Linear will be annotated to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w1": ColwiseParallel()},
>>> )
>>> ...

注意

默认情况下，如果未指定output_layouts，输出将在最后一个维度上进行分片，如果存在需要特定张量形状的运算符（例如，在配对的RowwiseParallel之前），请记住，如果输出被分片，则运算符可能需要调整以适应分片大小。

class torch.distributed.tensor.parallel.RowwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)[source]¶

按行方式划分一个兼容的 nn.Module。目前仅支持 nn.Linear。用户可以将其与 ColwiseParallel 组合，以实现更复杂模块的分片。（即 MLP、Attention）

Keyword Arguments

input_layouts (Placement, 可选) – nn.Module 输入张量的 DTensor 布局，用于注解输入张量以成为 DTensor。如果未指定，我们假设输入张量在最后一个维度上进行分片。
output_layouts (Placement, 可选) – nn.Module 的 DTensor 布局，用于确保 nn.Module 的输出符合用户期望的布局。如果未指定，则输出张量会被复制。
use_local_output (bool, 可选) – 是否使用本地 torch.Tensor 而不是 DTensor 作为模块输出，默认值：True。

Returns

一个 ParallelStyle 对象，表示 nn.Module 的行式分片。

Example::

>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> ...
>>> # By default, the input of the "w2" Linear will be annotated to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w2": RowwiseParallel()},
>>> )
>>> ...

要简单地使用 DTensor 布局配置 nn.Module 的输入和输出，并执行必要的布局重新分配，而无需将模块参数分发到 DTensors，可以使用以下类： parallelize_plan of parallelize_module 中的

class torch.distributed.tensor.parallel.PrepareModuleInput(*, input_layouts, desired_input_layouts, use_local_output=False)[source]¶

配置nn.Module的输入，以在运行时根据input_layouts将nn.Module的输入张量转换为DTensor，并根据desired_input_layouts执行布局重分布。

Keyword Arguments

input_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 的输入张量的 DTensor 布局，用于将输入张量转换为 DTensors。如果某些输入不是 torch.Tensor 或不需要转换为 DTensors，则需要指定 None 作为占位符。
desired_input_layouts (Union[Placement, Tuple[Placement]]) – 输入张量所需的 DTensor 布局，用于确保 nn.Module 的输入具有所需的 DTensor 布局。此参数的长度需要与 input_layouts 相同。
use_local_output (bool, 可选) – 是否使用本地 torch.Tensor 而不是 DTensor 作为模块输入，默认值为 False。

Returns

一个 ParallelStyle 对象，用于准备 nn.Module 输入的分片布局。

Example::

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> )

class torch.distributed.tensor.parallel.PrepareModuleOutput(*, output_layouts, desired_output_layouts, use_local_output=True)[source]¶

将nn.Module的输出配置为在运行时根据output_layouts将nn.Module的输出张量转换为DTensor，并根据desired_output_layouts执行布局重分配。

Keyword Arguments

output_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 的输出张量的 DTensor 布局，如果输出张量是 torch.Tensor，则用于将其转换为 DTensors。如果某些输出不是 torch.Tensor 或不需要转换为 DTensors，请将 None 指定为占位符。
desired_output_layouts (Union[Placement, Tuple[Placement]]) – 神经网络模块输出张量所需的 DTensor 布局，用于确保神经网络模块的输出具有所需的 DTensor 布局。
use_local_output (bool, optional) – 是否使用本地 torch.Tensor 而不是 DTensor 用于模块输出，默认值: False。

Returns

一个 ParallelStyle 对象，用于准备 nn.Module 输出的分片布局。

Example::

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "submodule": PrepareModuleOutput(
>>>             output_layouts=Replicate(),
>>>             desired_output_layouts=Shard(0)
>>>         ),
>>>     }
>>> )

对于像Transformer这样的模型，我们建议用户在parallelize_plan中一起使用ColwiseParallel 和RowwiseParallel，以实现整个模型（即Attention和MLP）所需的分片。

张量并行性 - torch.distributed.tensor.parallel¶

文档

教程

资源