Let’s start with a quick cost comparison — numbers don’t lie:
| Option | Monthly Cost | Annual Cost |
|---|---|---|
| ChatGPT Plus | $20 | $240 |
| OpenAI API (moderate usage) | $50–200 | $600–2400 |
| Self-hosted VPS (CPU inference) | $10–30 | $120–360 |
| Self-hosted VPS (GPU) | $30–150 | $360–1800 |
The more you use AI, the more cost-effective self-hosting becomes. If you only use it a few times a day, a subscription is usually cheaper. But for heavy daily use, team sharing, or when dealing with sensitive data you don’t want to send to third parties, self-hosting starts to make a lot of sense.
What Hardware Do You Actually Need to Run Large Models?
One of the most common mistakes people make is buying a VPS only to discover it doesn’t have enough memory to run the model.
Hardware Requirements by Model Size
7B Models (Entry Level)
- CPU inference: 8 GB RAM (no GPU needed)
- Inference speed: 1–5 tokens/second (usable but noticeably slow)
- Best for: Personal AI assistants, low-frequency chats, simple Agent tasks
14B Models (Mainstream Choice)
- Minimum: 16 GB RAM (GPU recommended)
- GPU inference: 6 GB+ VRAM
- Best for: AI customer service, content generation, scenarios needing better language quality
32B+ Models (Advanced)
- Minimum: 32 GB RAM, 16 GB+ VRAM for GPU inference
- Best for: Enterprise AI, complex reasoning tasks
Key rule: VRAM determines whether GPU inference is possible, RAM determines whether CPU inference works. If you’re short on both, nothing will run properly.
How Big Is the Gap Between CPU and GPU Inference?
It’s not just 2× or 3× — it’s orders of magnitude. The same 7B model might run at 1–5 tokens/second on CPU, but 40–80 tokens/second on an A100 GPU — a 10–50× difference.
For a single user doing casual chats, 1–5 tokens/second on CPU is barely acceptable. For multi-user services or real-time conversations, CPU speed is usually too slow and GPU becomes necessary.
Four Platforms Worth Considering
Vultr — Easiest GPU VPS to Get Started With
Vultr offers A100, L40, and other GPU instances with hourly billing and no long-term commitment. With over 30 global data centers and good OS images, it’s very friendly for AI workloads. CUDA and common frameworks work out of the box.
Real-world Ollama deployment is extremely simple:
# After connecting to the GPU instance
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.1:8b
The whole process takes under 5 minutes with almost no extra setup.
Since it’s hourly billing, you can spin up a GPU instance for testing and shut it down immediately when done — no wasted money on idle time. Great choice for individual developers and small teams who want to pay only when using it.
Best for: Quick GPU testing, early-stage projects with uncertain usage, or needing global node coverage.
DigitalOcean — Most Beginner-Friendly GPU Option
DigitalOcean’s GPU Droplets support H100, L40S, RTX 6000 Ada, and similar cards. They come pre-loaded with PyTorch, CUDA, and popular AI frameworks, so you barely need to configure anything.
Their control panel and documentation are the best among these providers, making troubleshooting much easier. For users with zero GPU server experience, DO has the lowest learning curve.
Note: GPU resources sometimes require approval and aren’t instantly available. Pricing isn’t the cheapest for the same GPU, but reliability is excellent.
Best for: Complete beginners, stable long-running AI services, or building SaaS products that need reliable inference.
Hetzner — Best Value for High-RAM CPU (No GPU)
Hetzner doesn’t offer GPU instances, but their high-RAM CPU servers deliver the best price-to-performance ratio. A 64 GB RAM machine typically costs €60–80 per month — 40–50% cheaper than comparable U.S. providers.
On their high-spec machines, quantized 7B models run decently on CPU. With optimized frameworks like llama.cpp (multi-threaded), you can reach 8–12 tokens/second, which is acceptable for low-concurrency personal use.
Best for: Tight budgets, users okay with CPU-only inference, mainly serving European users, or projects that don’t need GPU training.
How to Run CPU Inference on Hetzner
I recommend using llama.cpp — it’s significantly more efficient than Ollama for CPU:
# Install dependencies
apt install build-essential cmake -y
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_NATIVE=ON
cmake --build build --config Release -j$(nproc)
# Run the model
./build/bin/llama-server -m ./models/llama-3.1-8b-q4.gguf \
--host 0.0.0.0 --port 8080 \
--n-gpu-layers 0 \
--threads $(nproc)
--n-gpu-layers 0 forces pure CPU inference, and --threads should match your CPU core count.
RunPod — Most Flexible Pay-as-You-Go GPU
RunPod is a GPU compute marketplace that aggregates many providers. You can rent RTX 3090, A100, H100, etc. by the hour, usually 30–50% cheaper than big cloud providers. It supports custom Docker images and has plenty of ready-made AI templates.
Best for: Short-term testing, occasional heavy inference tasks, or using high-end GPUs without long-term commitment.
Not ideal for: Production services that need strong SLAs — RunPod’s reliability isn’t as consistent as Vultr or DigitalOcean.
Quick Deployment: Get Running in 3 Minutes
No matter which platform you choose, Ollama is currently the easiest way to run local models:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run llama3.1:8b # 7B–8B model
ollama run qwen2.5:14b # 14B model
Ollama automatically detects and uses any available GPU — no manual CUDA setup required.
It exposes an OpenAI-compatible API on port 11434 by default. You can connect any compatible client:
# Test the API
curl http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "Hello", "stream": false}'
For external access, put Nginx in front with authentication — never expose port 11434 directly to the internet:
# Example Nginx config with Basic Auth
server {
listen 443 ssl;
server_name your-domain.com;
location / {
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:11434;
}
}
Real-World Inference Speed Reference
| Platform / Config | Model | Inference Speed (tokens/s) |
|---|---|---|
| Vultr A100 | 7B Q4 | 40–80 |
| DigitalOcean L40S | 7B Q4 | 35–70 |
| Hetzner 32-core CPU | 7B Q4 | 8–15 |
| Basic 4-core VPS | 7B Q4 | 1–3 |
The gap between GPU and CPU inference is massive — it’s not just a few times faster, it’s orders of magnitude.
When Should You Self-Host vs. Keep Subscribing?
Worth self-hosting if:
- You use AI more than 2 hours per day
- Multiple team members need access
- You’re handling sensitive data you don’t want sent to third parties
- You want to run fine-tuned or custom models
Better to keep subscribing if:
- You only use AI a few times per week
- You want zero maintenance hassle
- You need top-tier performance (GPT-4o or Claude 3.5 Sonnet level) that local models still can’t match
Open-source models have become very capable — Llama 3.1 70B is close to GPT-4 on many tasks — but running it properly on GPU requires at least 40 GB VRAM, which isn’t cheap. For most individuals, quantized 7B and 14B models remain the practical sweet spot.
Practical Advice
If you’re thinking about self-hosting but aren’t sure whether it’s worth it, start by renting a GPU instance on RunPod or Vultr for a day or two. Measure your actual usage and calculate the real cost before committing to a long-term setup. Don’t buy a big-memory machine without testing first — make sure your needs actually match the hardware before making a serious investment.