Contact
AI

Die wahren Kosten von KI in Produktion

Empirium Team10 min read

The pricing page says $3 per million input tokens. You estimate 10,000 queries per day, do the math, and budget $500 per month. Three months later, you are spending $4,200 per month and cannot figure out where the money went.

API token costs are roughly 40% of total AI expenses. The other 60% is engineering time, infrastructure, monitoring, evaluation, and the debugging sessions that happen at 2 AM when the agent starts hallucinating pricing to your customers.

Here is the real cost breakdown, from projects we have shipped at Empirium.

Beyond Token Pricing

The Full Cost Stack

Most teams track API costs because they appear on the monthly invoice. Everything else gets buried in engineering time or general infrastructure budgets.

Cost Category % of Total Typical Monthly Cost
LLM API tokens 35-45% $500–$15,000
Engineering time (maintenance) 20-25% $2,000–$8,000
Infrastructure (servers, DBs, caching) 10-15% $200–$2,000
Monitoring and observability 5-10% $100–$500
Evaluation and testing 5-10% $200–$1,000
Embedding generation 3-5% $50–$300
Vector database hosting 2-5% $50–$500

The engineering time is the hidden killer. Every model update, every prompt change, every edge case that surfaces in production requires someone to investigate, fix, test, and deploy. A "simple" AI chatbot needs 8-15 hours of engineering time per month to keep running well.

Token Math That Matters

Understanding token economics requires knowing your actual usage patterns, not theoretical calculations:

  • System prompt: 500-2,000 tokens, sent with every request. At 10,000 queries/day with a 1,500 token system prompt, that is 15M input tokens/day — $45/day on Claude Sonnet just for the system prompt.
  • Conversation context: Each follow-up message includes the full conversation history. A 5-turn conversation costs 5x more than a single query because of the accumulated context.
  • RAG context injection: Retrieving and injecting 3-5 document chunks adds 1,000-3,000 tokens per query. That is $3-9/day per 10K queries.
  • Output tokens: Typically 2-5x more expensive than input tokens. A verbose agent that generates 500-token responses costs significantly more than one generating 150-token responses.

Cost Per Use Case

Real cost data from production deployments:

Customer Support Agent

A support agent handling first-line queries, with fine-tuning-comparison">RAG on your knowledge base and escalation to humans for complex issues.

Component Cost (10K queries/day)
LLM API (Sonnet, avg 2K tokens/query) $1,800/month
Embedding generation (queries + updates) $45/month
Vector database (Qdrant cloud) $99/month
Caching layer (Redis) $50/month
Monitoring (Datadog AI traces) $150/month
Engineering maintenance $3,000/month
Total $5,144/month
Cost per query $0.017

Compare this to a human support agent at $4,000-$6,000/month handling 40-60 tickets per day. The AI handles 10,000 queries for the cost of one human agent.

Content Generation Pipeline

Generating product descriptions, email drafts, or marketing copy at scale.

Component Cost (1K documents/day)
LLM API (GPT-4o, avg 4K tokens/doc) $3,600/month
Quality review sampling (10%) $800/month (human reviewers)
Template management infrastructure $200/month
Engineering maintenance $2,000/month
Total $6,600/month
Cost per document $0.22

Voice AI Agent

A phone-based agent handling inbound calls for appointment booking or lead qualification. See our voice AI platform comparison for details.

Component Cost (500 calls/day)
Voice AI platform (Vapi/Retell) $3,000–$6,000/month
LLM API (real-time, low latency model) $1,500/month
Telephony (SIP trunking) $500/month
Call recording and transcription $300/month
CRM integration infrastructure $200/month
Engineering maintenance $4,000/month
Total $9,500–$12,500/month
Cost per call $0.63–$0.83

Optimization Strategies

Most AI deployments can reduce costs by 60-80% without sacrificing quality. Here are the techniques in order of impact. For a deep dive, see our cost optimization guide.

1. Prompt Optimization (20-40% savings)

Shorter prompts = fewer tokens = lower costs. Most system prompts contain redundant instructions, overly verbose examples, and context that could be cached.

Before: 2,100 tokens system prompt After optimization: 890 tokens system prompt Savings: 58% reduction in input tokens per query

Techniques:

  • Remove "be helpful and professional" type instructions — the model does this by default
  • Replace verbose examples with structured format specifications
  • Use abbreviations and shorthand in system prompts (the model understands them)
  • Move static context to cached prompt prefixes (Anthropic prompt caching cuts cost by 90% on cached tokens)

2. Semantic Caching (15-30% savings)

Many queries are variations of the same question. "What are your hours?" and "When are you open?" and "What time do you close?" should all return the same cached response.

Implement semantic caching with an embedding similarity threshold of 0.93-0.95. Below 0.93, you risk serving wrong cached responses. Above 0.95, the cache hit rate drops too low to matter.

3. Model Routing (25-40% savings)

Not every query needs Claude Opus or GPT-4o. A lightweight classifier routes queries to the appropriate model:

Query Complexity Model Cost (per 1M input tokens)
Simple (FAQ, greetings, status) Haiku / GPT-4o-mini $0.25
Standard (support, search, summaries) Sonnet / GPT-4o $3.00
Complex (analysis, multi-step reasoning) Opus / o1 $15.00

In practice, 60-70% of queries are simple or standard. Routing them to cheaper models saves 25-40% of total API cost.

4. Batching (10-20% savings)

For non-real-time workloads (report generation, content creation, data processing), batch API calls offer 50% discounts from most providers. Anthropic's batch API processes requests within 24 hours at half the standard price.

Build vs Buy for Inference

At what scale does self-hosting make sense?

The Crossover Point

Self-hosting a model like Llama 3.1 70B on a single A100 GPU costs approximately $2,000/month (cloud GPU rental). That GPU handles roughly 100-200 requests per minute depending on context length.

At API pricing of $3/1M tokens and average query size of 2,000 tokens, 200 requests/minute costs approximately $25,000/month via API. The crossover point is around 30-50 requests per minute — below that, API is cheaper when you factor in engineering time for managing GPU infrastructure.

Scale Recommendation Monthly Cost
< 1,000 queries/day API provider $100–$500
1,000–50,000 queries/day API with optimization $500–$5,000
50,000–500,000 queries/day Hybrid (API + self-hosted for high-volume routes) $3,000–$15,000
> 500,000 queries/day Self-hosted primary, API fallback $5,000–$20,000

For more on self-hosting decisions, see our self-hosted LLM guide.

FAQ

How do I predict costs for a new AI feature? Build a prototype, run it against 1,000 representative queries, and measure actual token usage. Multiply by your expected daily volume. Add 30% buffer for conversation context growth, retry attempts, and edge cases. Then add engineering time: 10-20 hours/month for a simple feature, 40+ hours/month for a complex agent.

How should I set up budget alerts? Set alerts at 80% of expected monthly cost to catch anomalies early. Track daily spend and alert if any single day exceeds 2x the daily average. Monitor per-feature costs separately — a cost spike is easier to diagnose when you know which feature caused it.

Fine-tuned models: cheaper or more expensive? Fine-tuned models typically cost 1.5-2x base model pricing per token but require shorter prompts (no lengthy system instructions), which can net out to 20-30% savings. The break-even point is around 5,000 queries/day. Below that, the fine-tuning training costs and maintenance overhead exceed the per-query savings.

If you need help right-sizing your AI budget or optimizing an existing deployment, reach out to our team.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

Voice AI Agents für Vertrieb: Implementierungsleitfaden

A production-focused guide to deploying voice AI agents for sales operations. Architecture, platform comparison, cost analysis, and the integration challenges nobody warns you about.

View all AI articles

Related Resources

Need help with this?

Talk to Empirium