Most people who buy a VPS for AI workloads run into the same problems: not enough memory and the model crashes immediately with OOM, or a low-frequency multi-core CPU that makes inference painfully slow, or a plan marketed as SSD that performs like a spinning disk. These issues aren't visible on spec sheets—but once you understand the core indicators, selection becomes much more straightforward.
How AI VPS selection differs from standard web hosting
Website hosting requirements are relatively forgiving—1 core and 1GB RAM can run WordPress as long as the network is stable. AI workloads are fundamentally different:
- Local inference is memory-bound: insufficient RAM and the model won't load at all—no negotiation
- Inference speed depends on single-core CPU performance, not core count
- Model files are loaded frequently, so disk I/O directly affects response latency
- For API-calling scenarios, network latency matters more than bandwidth
These four points are the essential differences between choosing a VPS for AI versus choosing one for general hosting.
Five metrics that determine AI performance on a VPS
RAM: the hard floor for AI workloads
This is the metric you cannot compromise on. Loading a model requires fitting all parameters into memory. If there isn't enough RAM, the process doesn't run slowly—it gets killed immediately with an OOM error.
Practical reference:
- API gateway only, no local model: 2GB sufficient
- 7B quantized model (Q4): minimum 8GB, 16GB recommended
- 13B quantized model: 16GB minimum, 32GB recommended
- Multiple agents running concurrently: add requirements together
Many people buy a 4GB VPS expecting to run a 7B model and find it won't work. With 4GB total, the OS and Docker consume 1–2GB, leaving nothing for even the smallest quantized 7B variant. This isn't a configuration quality issue—it's basic arithmetic.
RAM requirements by model size
| Model scale | Minimum RAM | Recommended RAM |
|---|---|---|
| API gateway (no local model) | 1–2GB | 2–4GB |
| 3B quantized model | 4GB | 8GB |
| 7B quantized model | 8GB | 16GB |
| 13B quantized model | 16GB | 32GB |
| 34B+ models | 32GB+ | 64GB+ |
CPU: single-core performance beats core count
This is counterintuitive: LLM inference is primarily single-threaded intensive computation. A 2-core high-frequency CPU can outperform an 8-core low-frequency CPU for inference tasks.
When evaluating a VPS, ask about or test the CPU clock speed. Geekbench single-core score is the most direct indicator of inference capability. At equivalent price points, high-frequency CPU instances—like Vultr's High Frequency series—delivering 30–50% faster inference than standard CPU instances is normal.
Also confirm the virtualization type: KVM only, not OpenVZ. KVM allocates independent resources per instance with stable performance; OpenVZ shares a kernel with serious overselling, meaning advertised specs and actual available resources diverge significantly.
How to test CPU inference capability
curl -L -o gk5.sh https://rebrand.ly/gk5 && bash gk5.sh
A single-core score above 800 is the baseline; 1200+ delivers a reasonable inference experience; 1500+ is high-frequency instance territory.
Storage: NVMe is not optional
Model files range from 4GB to 20GB or more. Every cold start requires loading all of that from disk into memory. Standard SSD reads at 300–500MB/s; NVMe reaches 2000MB/s or more—a 2–5x difference in load time.
For inference services, this directly affects recovery time after restarts and vector database query performance. If you're using RAG (retrieval-augmented generation), the I/O pressure from vector search makes NVMe's advantage even more pronounced.
Plans claiming NVMe but delivering poor I/O do exist. Test before committing:
fio --name=test --size=1G --filename=testfile --bs=4k --rw=randrw --iodepth=64 --runtime=30 --time_based
Random 4K read/write speeds below 100MB/s indicate either non-NVMe storage or heavy overselling.
Network: more important for API-calling scenarios than you'd expect
If your setup calls external APIs like OpenAI, Claude, or OpenRouter, latency from the VPS to the API servers directly affects response speed. Calling OpenAI from a US node typically adds 20–50ms; from an Asian node it can be 100–200ms.
Node selection matters for latency-sensitive use cases. US West Coast for US-facing workloads also minimizes calls to Anthropic's API (hosted in the US); Singapore or Japan nodes serve Asian users better while keeping external API latency manageable.
For pure local inference with no external API dependency, network latency primarily affects user access speed and is less critical.
Virtualization type: KVM is the baseline requirement
For AI workloads: KVM only.
KVM instances have independently allocated CPU and memory—unaffected by other tenants. OpenVZ shares a kernel with elastic memory allocation: the nominal 2GB you're paying for may be less in practice, and CPU performance gets squeezed during peak hours. Confirm virtualization type before purchasing—most providers state it on the product page. RackNerd, Vultr, DigitalOcean, and Hetzner are all KVM; some extremely cheap promotional VPS use OpenVZ.
Budget-based recommendations
$3–8/month: suitable for using AI, not running AI
At this price point (1–2 cores, 1–4GB RAM), the only practical AI use case is running an API gateway or lightweight automation tools—VPS handles request forwarding and task scheduling while actual inference happens at an external API. Attempting to run any quantized model locally at this spec level produces an unusable experience. Don't waste time trying.
$8–20/month: the sweet spot for AI deployment
4 cores with 8–16GB RAM enables experimenting with 7B quantized models, deploying lightweight AI agent systems, and running AI automation tools like OpenClaw and n8n without resource pressure. This is the practical starting point for most personal AI projects. Hetzner's CX32 (4 cores, 8GB, €8.99/month) and Vultr's high-frequency 4GB instances both fall in this range with solid value.
$20+/month: production-grade AI deployment
16GB+ RAM enables stable 13B model inference, multi-agent concurrency, or serving as reliable infrastructure for small to mid-scale AI services. For larger models, consider GPU instances—Vultr and Lambda Labs offer hourly billing on GPU machines without requiring long-term commitment.
Test before committing
Regardless of budget, test the provider's official IP before purchasing:
# Test latency
ping provider_test_IP -c 20
# Check routing quality
mtr -r -c 50 provider_test_IP
After purchase, run complete benchmarks during the 30-day refund window—confirm CPU score, disk I/O, and network performance match expectations before deciding to keep or return.
The priority ordering
For AI VPS selection: RAM > single-core CPU performance > storage type (NVMe) > network > price.
Price last doesn't mean it's unimportant—it means that without the preceding hard requirements being met, a lower price provides no value. A machine with insufficient memory won't run what you need regardless of cost.