Skip to main content
All products

Autoscaling & Replicas

Throughput when you need it.

A single GPU has a ceiling. When you need more throughput, add replicas behind the same endpoint — requests spread across them automatically.

Throughput is priced in tiers, so you scale capacity without trading away the flat, predictable monthly bill.

What you get

No. 01

Replica fan-out

Add replicas to raise concurrent throughput. The endpoint stays the same; load spreads across the fleet.

No. 02

Tiered throughput

Pick a throughput tier that matches your traffic. Move up for a launch, back down when it settles.

No. 03

Predictable bills

More capacity is more replicas at a known rate — never a surprise per-token spike.