Den reelle kostnaden ved å kjøre AI i produksjon
The pricing page says $3 per million input tokens. You estimate 10,000 queries per day, do the math, and budget $500 per month. Three months later, you are spending $4,200 per month and cannot figure out where the money went.
API token costs are roughly 40% of total AI expenses. The other 60% is engineering time, infrastructure, monitoring, evaluation, and the debugging sessions that happen at 2 AM when the agent starts hallucinating pricing to your customers.
Here is the real cost breakdown, from projects we have shipped at Empirium.
Beyond Token Pricing
The Full Cost Stack
Most teams track API costs because they appear on the monthly invoice. Everything else gets buried in engineering time or general infrastructure budgets.
| Cost Category | % of Total | Typical Monthly Cost |
|---|---|---|
| LLM API tokens | 35-45% | $500–$15,000 |
| Engineering time (maintenance) | 20-25% | $2,000–$8,000 |
| Infrastructure (servers, DBs, caching) | 10-15% | $200–$2,000 |
| Monitoring and observability | 5-10% | $100–$500 |
| Evaluation and testing | 5-10% | $200–$1,000 |
| Embedding generation | 3-5% | $50–$300 |
| Vector database hosting | 2-5% | $50–$500 |
The engineering time is the hidden killer. Every model update, every prompt change, every edge case that surfaces in production requires someone to investigate, fix, test, and deploy. A "simple" AI chatbot needs 8-15 hours of engineering time per month to keep running well.
Token Math That Matters
Understanding token economics requires knowing your actual usage patterns, not theoretical calculations:
- System prompt: 500-2,000 tokens, sent with every request. At 10,000 queries/day with a 1,500 token system prompt, that is 15M input tokens/day — $45/day on Claude Sonnet just for the system prompt.
- Conversation context: Each follow-up message includes the full conversation history. A 5-turn conversation costs 5x more than a single query because of the accumulated context.
- RAG context injection: Retrieving and injecting 3-5 document chunks adds 1,000-3,000 tokens per query. That is $3-9/day per 10K queries.
- Output tokens: Typically 2-5x more expensive than input tokens. A verbose agent that generates 500-token responses costs significantly more than one generating 150-token responses.
Cost Per Use Case
Real cost data from production deployments:
Customer Support Agent
A support agent handling first-line queries, with fine-tuning-comparison">RAG on your knowledge base and escalation to humans for complex issues.
| Component | Cost (10K queries/day) |
|---|---|
| LLM API (Sonnet, avg 2K tokens/query) | $1,800/month |
| Embedding generation (queries + updates) | $45/month |
| Vector database (Qdrant cloud) | $99/month |
| Caching layer (Redis) | $50/month |
| Monitoring (Datadog AI traces) | $150/month |
| Engineering maintenance | $3,000/month |
| Total | $5,144/month |
| Cost per query | $0.017 |
Compare this to a human support agent at $4,000-$6,000/month handling 40-60 tickets per day. The AI handles 10,000 queries for the cost of one human agent.
Content Generation Pipeline
Generating product descriptions, email drafts, or marketing copy at scale.
| Component | Cost (1K documents/day) |
|---|---|
| LLM API (GPT-4o, avg 4K tokens/doc) | $3,600/month |
| Quality review sampling (10%) | $800/month (human reviewers) |
| Template management infrastructure | $200/month |
| Engineering maintenance | $2,000/month |
| Total | $6,600/month |
| Cost per document | $0.22 |
Voice AI Agent
A phone-based agent handling inbound calls for appointment booking or lead qualification. See our voice AI platform comparison for details.
| Component | Cost (500 calls/day) |
|---|---|
| Voice AI platform (Vapi/Retell) | $3,000–$6,000/month |
| LLM API (real-time, low latency model) | $1,500/month |
| Telephony (SIP trunking) | $500/month |
| Call recording and transcription | $300/month |
| CRM integration infrastructure | $200/month |
| Engineering maintenance | $4,000/month |
| Total | $9,500–$12,500/month |
| Cost per call | $0.63–$0.83 |
Optimization Strategies
Most AI deployments can reduce costs by 60-80% without sacrificing quality. Here are the techniques in order of impact. For a deep dive, see our cost optimization guide.
1. Prompt Optimization (20-40% savings)
Shorter prompts = fewer tokens = lower costs. Most system prompts contain redundant instructions, overly verbose examples, and context that could be cached.
Before: 2,100 tokens system prompt After optimization: 890 tokens system prompt Savings: 58% reduction in input tokens per query
Techniques:
- Remove "be helpful and professional" type instructions — the model does this by default
- Replace verbose examples with structured format specifications
- Use abbreviations and shorthand in system prompts (the model understands them)
- Move static context to cached prompt prefixes (Anthropic prompt caching cuts cost by 90% on cached tokens)
2. Semantic Caching (15-30% savings)
Many queries are variations of the same question. "What are your hours?" and "When are you open?" and "What time do you close?" should all return the same cached response.
Implement semantic caching with an embedding similarity threshold of 0.93-0.95. Below 0.93, you risk serving wrong cached responses. Above 0.95, the cache hit rate drops too low to matter.
3. Model Routing (25-40% savings)
Not every query needs Claude Opus or GPT-4o. A lightweight classifier routes queries to the appropriate model:
| Query Complexity | Model | Cost (per 1M input tokens) |
|---|---|---|
| Simple (FAQ, greetings, status) | Haiku / GPT-4o-mini | $0.25 |
| Standard (support, search, summaries) | Sonnet / GPT-4o | $3.00 |
| Complex (analysis, multi-step reasoning) | Opus / o1 | $15.00 |
In practice, 60-70% of queries are simple or standard. Routing them to cheaper models saves 25-40% of total API cost.
4. Batching (10-20% savings)
For non-real-time workloads (report generation, content creation, data processing), batch API calls offer 50% discounts from most providers. Anthropic's batch API processes requests within 24 hours at half the standard price.
Build vs Buy for Inference
At what scale does self-hosting make sense?
The Crossover Point
Self-hosting a model like Llama 3.1 70B on a single A100 GPU costs approximately $2,000/month (cloud GPU rental). That GPU handles roughly 100-200 requests per minute depending on context length.
At API pricing of $3/1M tokens and average query size of 2,000 tokens, 200 requests/minute costs approximately $25,000/month via API. The crossover point is around 30-50 requests per minute — below that, API is cheaper when you factor in engineering time for managing GPU infrastructure.
| Scale | Recommendation | Monthly Cost |
|---|---|---|
| < 1,000 queries/day | API provider | $100–$500 |
| 1,000–50,000 queries/day | API with optimization | $500–$5,000 |
| 50,000–500,000 queries/day | Hybrid (API + self-hosted for high-volume routes) | $3,000–$15,000 |
| > 500,000 queries/day | Self-hosted primary, API fallback | $5,000–$20,000 |
For more on self-hosting decisions, see our self-hosted LLM guide.
FAQ
How do I predict costs for a new AI feature? Build a prototype, run it against 1,000 representative queries, and measure actual token usage. Multiply by your expected daily volume. Add 30% buffer for conversation context growth, retry attempts, and edge cases. Then add engineering time: 10-20 hours/month for a simple feature, 40+ hours/month for a complex agent.
How should I set up budget alerts? Set alerts at 80% of expected monthly cost to catch anomalies early. Track daily spend and alert if any single day exceeds 2x the daily average. Monitor per-feature costs separately — a cost spike is easier to diagnose when you know which feature caused it.
Fine-tuned models: cheaper or more expensive? Fine-tuned models typically cost 1.5-2x base model pricing per token but require shorter prompts (no lengthy system instructions), which can net out to 20-30% savings. The break-even point is around 5,000 queries/day. Below that, the fine-tuning training costs and maintenance overhead exceed the per-query savings.
If you need help right-sizing your AI budget or optimizing an existing deployment, reach out to our team.