Intra-Operator: partitions
computationally intensive operators, such as matrix
multiplications, across multiple GPUs, accelerating
computation but causing substantial communication.
Inter-Operator: organizes LLM
layers into stages, each running on a GPU to form
pipelines.
DP
Data Parallelism is most common due to its
simplicity:
the dataset is split into several shards, each shard
is allocated to a device.
Collective communications involve network-intensive
operations are required after the operation.
Operations
AG (AllGather): column-wise
AR (AllReduce): row-wise
PP
the model is split by layer into several chunks, each
chunk is given to a device
During the forward pass, each device passes the
intermediate activation to the next stage. During the
backward pass, each device passes the gradient of the
input tensor back to the previous pipeline stage.
Device with PP operates on
mirco-batch split by stages.
gpipe
EP
Expert Parallism
Batching
Batching
Continuous Batching: continuous and
dynamic gpu memory
Static Batching: create static gpu
memory for each request
batch size: how many user inputs are processed
concurrently in the LLM.
metrics
Key Metrics for
evaluating the performance of LLMs