Parallelism in Distributed Training

References

Model Parallelism

huggingface.co/docs/transformers/v4.1...

Introduction to Model Parallelisim

docs.aws.amazon.com/sagemaker/latest/...

ChatGPT vs Thousands of GPUs! || How ML Models Train at Scale!

www.youtube.com/watch?v=xkH8shGffRU

LLM Training Steps

Forward Pass
Backward Pass (Gradient Calculation)
Optimizer Step - update the parameters using the calculated gradients

Distributed Training Parallelism:

Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs
- Synchronization: required after backward pass and before optimizer step (gradient aggregation using AllReduce)
- Limitation: Model size is limited by the VRAM of GPU
- Others:
  - Sharded DDP (= Zero DP): Goal — minimize local memory usage
    - Each GPU gets a horizontal slice of the weight parameters
    - Each GPU receives different input data for training
    - Each layer needs the other horizontal slices of weight parameters to process the input tensor fully —> AllGather operation for each layer
    - The gathered parameters are freed from local memory and the output activation is continuously processed by subsequent layers
Model Parallelism (MP): Split the model horizontally or vertically
1. Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix
  - Synchronization: required after each consecutive matrix multiplications
2. Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)
  - Synchronization: required after each group of layers
  - Limitation: purely relying on Pipeline Parallelism can lead to idle resources
  - Others:
    - https://research.google/blog/introducing-gpipe-an-open-source-library-for-efficiently-training-large-scale-neural-network-models/
      - Split the data into smaller chunks and then into micro-batches, and feed them to the first layer of the model (first GPU) to prevent idle resources
      - Interleaved Pipeline