Skip to main content

Pricing · Sheet 04

Flat monthly pricing. No surprise bills.

Pay for the GPU and hours you use, plus a small honest margin. Lock in the price at deploy time — every deployment gets a precise monthly quote before you confirm.

Range: $50/mo to $6,250+/mo

GPU tiers · Schedule A

GPU tiers

The optimizer selects from these GPUs based on your model size, precision, and priority. Prices shown are per GPU, per hour.

Start free
GPUTierVRAMBest for/hour
T4
Budget16 GBSmall models, batch jobs$0.59
L4
Budget24 GB7B–8B chat, moderate throughput$0.80
A10
Budget24 GBLatency-tolerant serving$1.10
L40S
Mid48 GB8B–13B production inference$1.95
A100 40 GB
Mid40 GBMid-size models, stable workloads$2.10
RTX Pro 6000
Mid96 GBMemory-heavy single-GPU serving$3.05
H100
Performance80 GB30B–70B, demanding latency SLAs$3.95
H200
Performance141 GBLong-context 70B, MoE models$4.54
B200
8 TB/s
Top192 GBFrontier models, lowest latency$6.25
A100 80 GB
Performance80 GBLegacy workloads requiring A100$8.99

Example configurations

Representative monthly estimates. Your actual quote is calculated from the live optimizer — this is just a starting point.

8B chatbot

Llama 3 8B Instruct

GPU
1× L40S
Hours / day
24

Estimated monthly

$140

Always-on single-GPU deployment for a conversational assistant.

70B assistant

Llama 3 70B Instruct

GPU
2× H100
Hours / day
24

Estimated monthly

$3,100

Two-GPU tensor-parallel serving for demanding production loads.

MoE frontier

Mixtral 8×22B

GPU
4× H100
Hours / day
24

Estimated monthly

$5,400

Four-GPU deployment for large mixture-of-experts models.

Estimate your monthly price

A rough approximation using our standard margin and a typical 70% active-time assumption. The real quote is calculated live in the dashboard from your model characteristics.

Estimate

$1,572

/month

Frequently asked questions

How is pricing calculated?
Every deployment gets a flat monthly price: underlying GPU-hour cost × projected active hours × our margin. When you create a deployment you see the exact quote before confirming, so there are no surprises on your bill.
What happens if I use more than my projected hours?
You get a soft-cap alert inside the dashboard. We do not auto-upgrade you to a more expensive plan. Instead, we surface the overage and let you decide whether to increase your projected hours, rightsize the GPU, or leave things as they are.
Which GPUs do you support?
T4, L4, A10, L40S, A100 40 GB, RTX Pro 6000, H100, H200, B200, and A100 80 GB. The smart optimizer picks the right one automatically based on your model size, precision, and priority (latency, throughput, or balanced).
Can I use my own domain for inference endpoints?
Yes. Register your domain inside the dashboard, point the provided CNAME record at us, and TLS is provisioned automatically. Your endpoint is private and dedicated — no shared queues, no noisy neighbors.
What inference engines do you use?
Either vLLM (batched, high-throughput) or SGLang (low-latency with speculative decoding). The optimizer picks whichever fits your priority setting, and it also enables EAGLE speculative decoding on compatible models for roughly 25% lower latency.
Which models can I deploy?
Anything on HuggingFace or Ollama — Llama, Mistral, Mixtral, Qwen, Gemma, DeepSeek, and more. Search 100k+ open-source models directly from the dashboard. If a model fits on a supported GPU at your chosen precision, you can run it.
Where are deployments hosted?
Today we deploy to the us-east region on Modal. Additional regions are on the roadmap — reach out if you have specific region or residency requirements.
Can I cancel anytime?
Yes. Stop a deployment from the dashboard and billing pro-rates to the moment you stop. There are no annual commitments or cancellation fees.