Parallelism | batch dim | sequence dim | hidden dim | weights | optimizer |
---|---|---|---|---|---|
DP | √ | ||||
TP | √ | √(intra-layer) | |||
Ring Attention | √ | ||||
PP | √(intra-layer) | √ | |||
ZeRO | √ | √ | |||
FSDP | √ | √(intra-layer) | √ |
直接部署多份服务,将数据拆分到不同的服务上。
将矩阵运算拆分到多个 GPU 上,运算完之后进行 AllGather 或者 AllReduce 的操作。AllGather 适用于 Column-wise sharding,AllReduce 适用于 Row-wise sharding。
对于矩阵乘法 \(A \times B\)
Column-wise sharding 将 \(B\) 拆分为 \(B_{0}, B_{1}\)
对结果进行 AllGather 操作,every shard calculates the partial target martix.
对结果进行 AllReduce 操作,every shard calculates the partial result of the whole target matrix.
import torch
# Non-split matrices used for verification.
= torch.rand(16, 32)
A = torch.rand(32, 8)
B = A @ B
C
# Split A along columns, B along rows, then:
#
# GPU 1: A1, B1
# GPU 2: A2, B2
= A.chunk(2, dim=1)
A1, A2 = B.chunk(2, dim=0)
B1, B2 = A1 @ B1
C1 = A2 @ B2
C2
# Verify that the result is the same after an element-wise sum.
+ C2) torch.testing.assert_close(C, C1
the model is split by layer into several chunks, each chunk is given to a device
During the forward pass, each device passes the intermediate activation to the next stage. During the backward pass, each device passes the gradient of the input tensor back to the previous pipeline stage.
Device with PP operates on mirco-batch split by stages.
Expert Parallism