Study on vLLM | Joon Ha Kim

References

Efficient Memory Management for Large Language Model Serving with PagedAttention

I keep a different record for paper summaries but vLLM (Paged Attention) helped me understand LLMs greatly.

vLLM with Different Decoding Scenarios

vLLM was a fruitful paper to read as it applied traditional OS paging techniques to KV Cache Tables. Usually these KV Caches are stored in the memory hierarchy units of however large the LLM designer chose the KV Cache to be. The KV Cache size per token is determined by the following equation (factor of 2 is because we need to allocate a key and value vector):

\[{KV\_Cache_{per\_token}} = 2 \times \{hidden\_state\_size\}\times \{number\_of\_layers\} \times \{byte\_granularity\}\]

Note that each attention layer, and each head has its own KV Cache, so this scales linearly for each complexity we add! This is huge!’

Traditionally, the KV Cache must be allocated in consecutive memory blocks for efficiency, but this creates a lot of internal/external fragmentation. Traditionally, internal fragmentation occurs because of statically allocating a contiguous chunk of memory of the request’s possible maximum length (not all of it is used). External fragmentation occurs because of the nature of memory allocation (buddy allocator – allocates size of power of 2). Similar to in OS, paging helps prevent these problems, which is why PagedAttention has come to be.

This is all great, but how does it apply to existing complex decoding algorithms ?

1. Parallel Sampling

Parallel Sampling in LLM inference is a decoding technique that allows multiple generation of next tokens, allowing users to choose their favorite output from these candidates. A reference counter for each physical block is used to keep track of logical blocks from the same request. Different sample outputs for the same request have different logical block allocations, and these could point to the same initial physical block. When differing outputs are generated for each samples, a copy-on-write is used to allocate a new physical block, and the reference count for the original physical block is decremented by 1.

2. Beam Search

Beam search is another LLM inference decoding technique that allows the model to search for the most optimal output given an input sequence. It differs to parallel sampling in that it beam search expands each candidate sequence in the beam by considering all possible tokens and retains the top-k most probably sequences. In the perspective of memory, this means there are more blocks shared rather than the first few as done in parallel sampling. Traditionally, each addition to the KV cache for all possible tokens and beams must be allocated to memory, but now multiple blocks can be shared using reference counters.

3. Shared Prefix

System prompts are concatenated with user prompts, and these are shared across different user requests. Instead of re-allocating this system prompt, we can fix a physical block to hold the KV cache for the shared prefix, mapping logical blocks to the same physical block.

vLLM Swapping and Recomputation

1. Swapping

Swap Out: When GPU memory runs out, vLLM selects a sequence to preempt. It then copies that sequence’s KV cache blocks from the GPU to the CPU’s RAM.
Swap In: After other requests finish and free up GPU memory, the preempted sequence’s KV cache blocks are copied back from the CPU to the GPU to resume processing.

2. Recomputation

Preemption: When a sequence is preempted, its KV cache is deleted.
Rescheduling: To resume the sequence, the system treats the original prompt and all the tokens generated so far as a single, new prompt. It then recomputes the entire KV cache for this combined sequence in one step.

3. Eviction Policy

all-or-nothing eviction policy, meaning either all the KV cache blocks for a sequence are preempted or none are. This is because all of a sequence’s blocks are needed to generate the next token.

The choice between swapping and recomputation depends on the system’s configuration:

Swapping is more efficient with large block sizes because it can take better advantage of the PCIe bandwidth for large data transfers.
Recomputation is more efficient with small block sizes, as swapping many small blocks leads to inefficient data transfers