Parallelism in Distributed Training

References


LLM Training Steps

  1. Forward Pass
  2. Backward Pass (Gradient Calculation)
  3. Optimizer Step - update the parameters using the calculated gradients

Distributed Training Parallelism:

  1. Data Parallelism (DP): Split the data into batches and copy the model into the participating GPUs
    • Synchronization: required after backward pass and before optimizer step (gradient aggregation using AllReduce)
    • Limitation: Model size is limited by the VRAM of GPU
    • Others:
      • Sharded DDP (= Zero DP): Goal — minimize local memory usage

        • Each GPU gets a horizontal slice of the weight parameters
        • Each GPU receives different input data for training
        • Each layer needs the other horizontal slices of weight parameters to process the input tensor fully —> AllGather operation for each layer
        • The gathered parameters are freed from local memory and the output activation is continuously processed by subsequent layers
  2. Model Parallelism (MP): Split the model horizontally or vertically
    1. Tensor Parallelism (TP): Split the model horizontally (mainly to calculate matrix multiplication faster) — each GPU has a portion of t¬he split parameter matrix

      • Synchronization: required after each consecutive matrix multiplications
    2. Pipeline Parallelism (PP): Split the model vertically (a group of layers is distributed across GPUs)