Question 1

How is pricing calculated?

Accepted Answer

Every deployment gets a flat monthly price: underlying GPU-hour cost × projected active hours × our margin. When you create a deployment you see the exact quote before confirming, so there are no surprises on your bill.

Question 2

What happens if I use more than my projected hours?

Accepted Answer

You get a soft-cap alert inside the dashboard. We do not auto-upgrade you to a more expensive plan. Instead, we surface the overage and let you decide whether to increase your projected hours, rightsize the GPU, or leave things as they are.

Question 3

Which GPUs do you support?

Accepted Answer

T4, L4, A10, L40S, A100 40 GB, RTX Pro 6000, H100, H200, B200, and A100 80 GB. The smart optimizer picks the right one automatically based on your model size, precision, and priority (latency, throughput, or balanced).

Question 4

Can I use my own domain for inference endpoints?

Accepted Answer

Yes. Register your domain inside the dashboard, point the provided CNAME record at us, and TLS is provisioned automatically. Your endpoint is private and dedicated — no shared queues, no noisy neighbors.

Question 5

What inference engines do you use?

Accepted Answer

Either vLLM (batched, high-throughput) or SGLang (low-latency with speculative decoding). The optimizer picks whichever fits your priority setting, and it also enables EAGLE speculative decoding on compatible models for roughly 25% lower latency.

Question 6

Which models can I deploy?

Accepted Answer

Anything on HuggingFace or Ollama — Llama, Mistral, Mixtral, Qwen, Gemma, DeepSeek, and more. Search 100k+ open-source models directly from the dashboard. If a model fits on a supported GPU at your chosen precision, you can run it.

Question 7

Where are deployments hosted?

Accepted Answer

Today we deploy to the us-east region on Modal. Additional regions are on the roadmap — reach out if you have specific region or residency requirements.

Question 8

Can I cancel anytime?

Accepted Answer

Yes. Stop a deployment from the dashboard and billing pro-rates to the moment you stop. There are no annual commitments or cancellation fees.

GPU	Tier	VRAM	Best for	/hour
T4	Budget	16 GB	Small models, batch jobs	$0.59
L4	Budget	24 GB	7B–8B chat, moderate throughput	$0.80
A10	Budget	24 GB	Latency-tolerant serving	$1.10
L40S	Mid	48 GB	8B–13B production inference	$1.95
A100 40 GB	Mid	40 GB	Mid-size models, stable workloads	$2.10
RTX Pro 6000	Mid	96 GB	Memory-heavy single-GPU serving	$3.05
H100	Performance	80 GB	30B–70B, demanding latency SLAs	$3.95
H200	Performance	141 GB	Long-context 70B, MoE models	$4.54
B200 8 TB/s	Top	192 GB	Frontier models, lowest latency	$6.25
A100 80 GB	Performance	80 GB	Legacy workloads requiring A100	$8.99

Flat monthly pricing. No surprise bills.

GPU tiers

Example configurations

8B chatbot

70B assistant

MoE frontier

Estimate your monthly price

Frequently asked questions