Choosing the right GPU for your model
A practical map from model size and precision to the cheapest GPU that still hits your latency target — from a T4 to a B200.
By The Seattle Compute Team · Infrastructure
Picking a GPU is mostly a memory problem, then a latency problem, then a money problem — in that order. Get the memory wrong and nothing runs; get the latency wrong and your users feel it; get the money wrong and your finance team feels it. Here is the order we walk through every time someone deploys a model.
Step 1 — Will the weights fit?
Start with the dumbest, most important question: does the model fit in VRAM with room for the KV cache? A rough rule for transformer weights is bytes ≈ params × bytes-per-param. At FP16 that is two bytes per parameter, so an 8B model needs roughly 16 GB just for weights before any context.
- FP16 / BF16 — 2 bytes/param. The default. Best quality, biggest footprint.
- FP8 — 1 byte/param. Roughly half the memory, a small quality hit on most models.
- INT4 / AWQ / GPTQ — ~0.5 bytes/param. Squeezes a 70B onto a single 80 GB card, with a quality trade you should measure, not assume.
Then leave headroom for the KV cache. Long context and high concurrency both eat VRAM fast — a 70B model serving 32k-token requests at high batch can spend more on cache than on weights.
Step 2 — What latency do you owe your users?
Two numbers matter: time-to-first-token (TTFT) and tokens-per-second. A chat UI lives and dies on TTFT — anything under ~300 ms feels instant. A batch summarization job does not care about TTFT at all and only wants throughput per dollar.
Match the GPU to the job, not to the logo. A latency-tolerant batch pipeline on an H100 is money on fire.
Step 3 — Now optimize for cost
Once you know what fits and what you can tolerate, pick the cheapest GPU that clears both bars. For most 7B–13B production chat that is an L40S; for 70B with real latency SLAs it is one or two H100s; for frontier and long-context MoE it is an H200 or B200.
# Quick VRAM sanity check before you deploy
params_billion = 70
bytes_per_param = 1.0 # FP8
weights_gb = params_billion * bytes_per_param
kv_overhead = 1.4 # ~40% headroom for KV cache + activations
needed_gb = weights_gb * kv_overhead
print(f"~{needed_gb:.0f} GB VRAM needed") # ~98 GB -> H200 (141GB) or 2x H100If that feels like a lot of bookkeeping: it is, and it is exactly the part we automate. You give us a model, a precision, and a priority, and the optimizer returns the GPU plus a flat monthly quote before you confirm anything.
Want the full table of GPUs, VRAM, and per-hour rates? It lives on the pricing page, and every deployment shows you the exact quote before you commit.
Ready to deploy a model?
See the GPU tiers and get a flat monthly quote before you commit.