All products
Autoscaling & Replicas
Throughput when you need it.
A single GPU has a ceiling. When you need more throughput, add replicas behind the same endpoint — requests spread across them automatically.
Throughput is priced in tiers, so you scale capacity without trading away the flat, predictable monthly bill.
What you get
No. 01
Replica fan-out
Add replicas to raise concurrent throughput. The endpoint stays the same; load spreads across the fleet.
No. 02
Tiered throughput
Pick a throughput tier that matches your traffic. Move up for a launch, back down when it settles.
No. 03
Predictable bills
More capacity is more replicas at a known rate — never a surprise per-token spike.