Study on LLM Inference

References


LLM Inference

  • Prefill Phase: process entire user input prompt and initiates/updates the KV Cache → token logits are available
  • Token Sampling (uses sampling process to select the next new token)
  • Decode Phase: Transformer block containing Attention + MLP layer
    • Attention: token to output next is used to calculate KV and append to the KV Cache
    • MLP: Generate next token until EOS (end-of-sequence) is generated
      • Requires access to key and value activations of previously processed tokens to perform attention
      • Stored in KV Cache
  • Additional notes:
    • There can be multiple Transformer blocks
      • each are responsible for extracting different context for the same token
      • each layer has its own KV Cache
      • each layer produces its own token_x to serve as input to next transformer block
    • There can be multiple Attention layers (run in parallel) within a single Transformer block
      • each attention layer is responsible for focusing on different parts of the sequence for a given token
      • each attention layer outputs a vector representing a token
      • these tokens are concatenated then projected to expected dimensions for the MLP layer
    • For a transformer block, and for a given input, X, (Q,K,V) matrix is generated by a learned matrix (W_q, W_k, W_v). This is then equally split by the number of attention heads in that layer.
      • Q is disposed
      • K and V are stored in the KV Cache
  • Optimization Techniques:
    • TP / PP parallelism
    • Prefill / Decode Priority Scheduling
    • Note: Optimization technique is dependent on the model and workload

Decode Step : Attention + MLP

  • Attention Kernel - dependent on request history (KV Cache)
    • Current token to be processed (Query token) is compared against all previous tokens (Key vectors) to calculate the attention score
      • Attention score: how relevant each previous token is to the current one
      • Attention score is used to create a weighted average of the Value vectors for all previous tokens
      • Output is a vector of fixed size that represents the token’s meaning with the given context
    • Work needed to be done is proportional to the length of the KV Cache (request history)
  • MLP Kernel/Layer - independent of KV Cache
    • MLP takes an input from the output of the Attention Layer
    • Doesn’t need the KV Cache as the input vector is already defined within the context of the request history
    • Amount of computation done in MLP is constant for any single token (regardless of the position of the token)