Parallelism in Distributed Training
References
LLM Training Steps
- Forward Pass
- Backward Pass (Gradient Calculation)
- Optimizer Step - update the parameters using the calculated gradients
Distributed Training Parallelism:
- Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs
- Synchronization: required after backward pass and before optimizer step (gradient aggregation using
AllReduce
) - Limitation: Model size is limited by the VRAM of GPU
- Others:
-
Sharded DDP (= Zero DP): Goal — minimize local memory usage
- Each GPU gets a horizontal slice of the weight parameters
- Each GPU receives different input data for training
- Each layer needs the other horizontal slices of weight parameters to process the input tensor fully —>
AllGather
operation for each layer - The gathered parameters are freed from local memory and the output activation is continuously processed by subsequent layers
-
- Synchronization: required after backward pass and before optimizer step (gradient aggregation using
- Model Parallelism (MP): Split the model horizontally or vertically
-
Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix
- Synchronization: required after each consecutive matrix multiplications
-
Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)
- Synchronization: required after each group of layers
- Limitation: purely relying on Pipeline Parallelism can lead to idle resources
- Others:
- https://research.google/blog/introducing-gpipe-an-open-source-library-for-efficiently-training-large-scale-neural-network-models/
- Split the data into smaller chunks and then into micro-batches, and feed them to the first layer of the model (first GPU) to prevent idle resources
- Interleaved Pipeline
- https://research.google/blog/introducing-gpipe-an-open-source-library-for-efficiently-training-large-scale-neural-network-models/
-
