The Default Is Expensive
The standard enterprise AI stack starts with cloud API subscriptions: OpenAI for inference, Pinecone for vectors, Datadog for monitoring, a managed Kubernetes cluster for orchestration. Each service is reasonable in isolation. Together, they create a cost structure that scales linearly with usage — and a dependency chain that no single team controls.
A typical mid-scale deployment (10,000 monthly inference requests, vector search, monitoring, and orchestration) runs $500–2,000/month in SaaS costs before any custom engineering.
The alternative — self-hosted open-source infrastructure — is technically mature, operationally manageable, and dramatically cheaper. But it requires a different engineering mindset.
The Cost Arithmetic
The math is straightforward once you separate fixed costs from variable costs.
Cloud API model — costs scale with usage:
- 10,000 inference requests/month via GPT-4: $300–1,000
- Managed vector database: $50–200
- Monitoring + observability: $50–500
- Orchestration platform: $100–500
- Total: $500–2,200/month (and rising with usage)
Self-hosted model — costs are fixed:
- Inference server (16GB RAM, CPU-optimized): ~$22/month
- Operations server (4GB RAM, general purpose): ~$9/month
- Open-source models, vector DB, monitoring: $0
- Total: $31/month (regardless of usage)
The 10,000th request costs exactly the same as the first: nothing. The infrastructure is already provisioned.
When Self-Hosting Makes Sense
Self-hosted AI isn't the right choice for every workload. It excels in specific scenarios:
High-volume, predictable workloads. Document processing, classification, summarization, operational queries — tasks that run continuously at known volumes. The fixed-cost model means unit economics improve with scale.
Operations and internal tools. Status checks, task management, workflow automation, knowledge search — internal systems where 3B–7B parameter models provide adequate quality. These workloads don't need GPT-4-class reasoning.
Data-sensitive environments. When compliance requirements restrict data from leaving your infrastructure, self-hosted inference eliminates the "data in transit to third-party API" risk entirely. The data never leaves your network.
Where it doesn't make sense: Research tasks requiring frontier-model reasoning, one-off analyses, or workloads with unpredictable spikes where cloud elasticity matters. The best architectures use both — self-hosted for the baseline, cloud APIs for the peaks.
The Open-Source Stack Is Production-Ready
The tooling has matured significantly. A production-grade self-hosted AI stack can be assembled from battle-tested open-source components:
Model serving — Ollama provides a simple runtime for open-source models. Pull a model, serve it via HTTP. Models from 1B to 70B+ parameters, depending on hardware.
Model routing — LiteLLM provides an OpenAI-compatible API proxy that routes requests to any backend. Applications code against the OpenAI API format. The backend can be swapped from local models to cloud APIs without changing application code.
Vector search — Qdrant runs as a single binary with an HTTP API. For collections under 10,000 vectors, a 512MB memory allocation is more than sufficient. No managed service required.
Monitoring — Prometheus + Grafana provide the same metrics dashboards as managed monitoring services. Self-hosted, no per-metric billing, no data retention limits.
Orchestration — Coolify or similar PaaS tools handle Docker deployment, rollbacks, and environment management. One-click deploys without Kubernetes complexity.
The Engineering Trade-Off
Self-hosting trades money for attention. The cost savings are real, but so is the operational overhead:
What you gain:
- Fixed, predictable costs regardless of usage volume
- Complete data ownership — nothing leaves your network
- No vendor lock-in — every component is swappable
- No rate limits, usage caps, or surprise billing
- Full control over model versions, configurations, and update timing
What you accept:
- Server maintenance (updates, security patches, disk management)
- Incident response (the December crash at 3 AM is your problem)
- Capacity planning (you size the hardware, not the cloud provider)
- Model management (you choose and update models yourself)
The overhead is manageable for teams with basic Linux operations skills. It's prohibitive for teams that have never SSH'd into a server. Be honest about your team's capabilities before committing.
Architecture Principles
Three principles guide effective self-hosted AI infrastructure:
Separate compute concerns. Operations services (project management, automation, monitoring) need stable, predictable resources. AI inference is CPU-hungry and bursty. Run them on separate hardware. The cost of a second server is less than the cost of one resource contention incident.
Private by default. Every admin interface, AI endpoint, and internal service should be private-network-only. Use a mesh VPN (like Tailscale) for encrypted, zero-config connectivity between all nodes. Only expose what must be public, behind a CDN with DDoS protection.
Proxy everything. Put an OpenAI-compatible proxy between your applications and your models. This gives you one integration point for all AI capabilities, the ability to swap backends without code changes, and a single place to add logging, rate limiting, and cost tracking.
The Decision Framework
Choose self-hosted when:
- Monthly inference volume exceeds 5,000 requests
- Workloads are predictable (not spiky)
- Data sensitivity requires on-premises processing
- Your team can maintain Linux servers
- You want to eliminate variable AI costs entirely
Choose cloud APIs when:
- You need frontier-model reasoning (GPT-4, Claude, Gemini class)
- Workloads are unpredictable or bursty
- Your team lacks operations experience
- Time-to-market matters more than unit economics
- You're still validating whether AI adds value for this use case
The most effective architectures use both: self-hosted infrastructure for the 80% of workloads where open-source models suffice, cloud APIs for the 20% where frontier capabilities are required.