Tools

This is a list of tools that I used to keep track of details since I keep forgetting. I will add on to the list every time I use something new.


perf_analyzer

I used this tool to measure the TTFT (Time to First Token) and ITL (Inter-Token Latency) when deploying a simple inference model (Qwen or llama) on a AWS cluster. For Ubuntu 22.04 or below, you have to follow the instructions here that tells you to containerize this. Make sure you connect the container to your host network so that you can use the perf_analyzer locally too.

docker build -f Dockerfile.perf-analyzer -t perf-analyzer:local .
docker run --rm --net host perf-analyzer:local \
  genai-perf profile -m Qwen/Qwen3-0.6B -u http://localhost:8000 \ 
  --endpoint-type chat --concurrency 10 \
  --synthetic-input-tokens-mean 2000 --output-tokens-mean 1000 \
  --streaming --request-count 15

AWS

EC2 (Elastic Compute Cloud)

Provides instances of GPU in specific placement groups of choice. This can be used like a normal cluster or server. Used for testing Dynamo and performance checking disaggregated serving.

Placement Groups:
  1. Cluster - physically close instances to run intra-node communication (placed in a rack)
  2. Partition - instances in one partition do not share underlying hardware with other groups of instances in different partitions
    • give more flexibility in terms of placement
  3. Spread - place small group of instances across distinct underlying hardware to reduce correlated failures
    • this is the default if note specified

      EKS (Elastic Kubernetes Service)

      Another option to use when I need to deploy an inference model for production. Not necessary for benchmarking or performance testing.