Operatörens guide till AI-kostnadsoptimering
Your AI feature works. Users love it. Then you get the monthly invoice: $8,400. Your projected budget was $2,000. You have two options: cut the feature or cut the costs.
Most AI deployments are 3-5x more expensive than they need to be. Not because the models are overpriced — because the architecture is wasteful. Redundant tokens, missed cache opportunities, expensive models handling simple queries, and synchronous processing where batching would suffice.
Here is the optimization stack we apply at Empirium, in order of impact. Combined, these techniques reduce costs by 60-80% without measurable quality loss.
The Cost Optimization Stack
Five layers, each independent, each compounding:
| Layer | Technique | Typical Savings | Implementation Effort |
|---|---|---|---|
| 1 | Prompt optimization | 20-40% | 1-2 days |
| 2 | Semantic caching | 15-35% | 3-5 days |
| 3 | Model routing | 25-45% | 5-10 days |
| 4 | Batching | 10-20% | 2-3 days |
| 5 | Self-hosting (high-volume routes) | 40-70% | 2-4 weeks |
Layers 1-3 are quick wins. Layer 4 applies to async workloads. Layer 5 only makes sense at scale. Start from the top.
Prompt Optimization
The fastest, cheapest cost reduction. Most system prompts are 2-3x longer than they need to be.
Cut Redundant Instructions
Before (2,100 tokens):
You are a helpful, professional, and knowledgeable customer support assistant
for Empirium. You should always be polite, respond accurately, and provide
helpful information. Never make up information. If you don't know something,
say so. Always try to be as helpful as possible while maintaining a
professional tone...
After (380 tokens):
You are Empirium's support assistant. Answer from the provided context only.
If the answer isn't in the context, say "I don't have that information" and
suggest contacting [email protected]. Be concise and direct.
Same behavior. 82% fewer tokens. At 10,000 queries/day, that saves ~17M input tokens/day — $51/day on Claude Sonnet.
Use Prompt Caching
Anthropic and OpenAI both offer prompt caching. Static portions of your prompt (system message, tool definitions, static context) are cached on the provider side and billed at 90% discount on subsequent requests.
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: [
{
type: 'text',
text: systemPrompt, // 1000 tokens, cached after first request
cache_control: { type: 'ephemeral' }
}
],
messages: [{ role: 'user', content: userQuery }],
});
For a 1,000-token system prompt at 10,000 queries/day:
- Without caching: $30/day input cost
- With caching: $3/day (first request full price, subsequent 90% off)
- Savings: $810/month
Compress Conversation History
Multi-turn conversations accumulate tokens. A 10-message conversation repeats all previous messages with each API call. By message 10, you are paying for the full conversation context.
Instead of sending the full history, summarize it:
Full history (10 messages): ~8,000 tokens per request
Summarized context: ~500 tokens per request
Savings: 93% token reduction on context
Summarize every 5 messages or when context exceeds 4,000 tokens. Use a cheap model (Haiku/GPT-4o-mini) for summarization.
Optimize Output Length
Verbose responses cost more. If your system prompt says "explain in detail," you get 500-token responses. If it says "respond concisely in 1-3 sentences," you get 75-token responses.
Output tokens cost 3-5x more than input tokens. Cutting average response length from 400 to 150 tokens saves 60% on output costs.
Semantic Caching
35-45% of customer support queries are semantic duplicates. "What are your hours?" and "When do you open?" and "Are you open on weekends?" all have the same answer.
How It Works
- Convert the incoming query to an embedding vector
- Search your cache for vectors with cosine similarity > 0.93
- If found: return the cached response (no API call)
- If not found: call the API, cache the response with its embedding
async function queryWithCache(query: string): Promise<string> {
const embedding = await embed(query);
const cached = await vectorCache.search(embedding, threshold: 0.93);
if (cached) {
metrics.cacheHit++;
return cached.response;
}
const response = await llm.chat(query);
await vectorCache.store(embedding, response, ttl: 3600);
metrics.cacheMiss++;
return response;
}
Cache Configuration
| Parameter | Recommended | Why |
|---|---|---|
| Similarity threshold | 0.93-0.95 | Below 0.93: wrong cached responses. Above 0.95: too few cache hits. |
| TTL | 1-24 hours | Balances freshness vs cache hit rate. Shorter for dynamic data. |
| Cache size | 10,000-50,000 entries | Covers most recurring query patterns. |
| Embedding model | text-embedding-3-small | Cheap ($0.02/1M tokens), good enough for similarity matching. |
Cache Hit Rates by Use Case
| Use Case | Typical Cache Hit Rate |
|---|---|
| Customer FAQ | 40-55% |
| Product information | 30-40% |
| Technical support | 15-25% |
| Creative/content generation | 5-10% |
| General conversation | 10-20% |
Model Routing
The most impactful optimization for diverse query workloads. Different queries need different models.
The Router Architecture
Query → Complexity Classifier (Haiku, ~5ms, ~$0.0001)
├→ Simple (FAQ, greetings, status checks) → Haiku/GPT-4o-mini
├→ Standard (support, search, summaries) → Sonnet/GPT-4o
└→ Complex (analysis, reasoning, multi-step) → Opus/GPT-4o
The classifier itself is a cheap, fast model call that adds less than $0.01/day in overhead and 30-50ms of latency.
Cost Impact
| Query Distribution | All-Sonnet Cost | Routed Cost | Savings |
|---|---|---|---|
| 50% simple, 35% standard, 15% complex | $3,000/month | $1,200/month | 60% |
| 30% simple, 50% standard, 20% complex | $3,000/month | $1,800/month | 40% |
| 10% simple, 40% standard, 50% complex | $3,000/month | $2,400/month | 20% |
The more queries you can route to cheap models, the more you save. Invest in making your prompts work well on smaller models — it pays for itself immediately.
Quality Guardrails
Model routing introduces a risk: the cheap model gives a bad answer because the query was misclassified. Mitigate with:
- Confidence scoring: If the cheap model's confidence is below threshold, re-route to the expensive model
- Output validation: Check response format and content against expected patterns
- Sampling: Route 5% of "simple" queries to the expensive model and compare outputs. If divergence is high, the classifier needs tuning.
Batching
For non-real-time workloads, batch processing offers 50% discounts from most providers.
Anthropic Batch API
Submit requests in bulk. Results delivered within 24 hours at 50% discount.
Use cases:
- Nightly content generation (product descriptions, email campaigns)
- Document processing (classification, extraction, summarization)
- Report generation (weekly analytics summaries, lead scoring updates)
- Evaluation pipelines (running test suites against production samples)
Implementation Pattern
// Collect requests during the day
const batchQueue: BatchRequest[] = [];
function queueForBatch(request: ChatRequest) {
batchQueue.push({
custom_id: generateId(),
params: request,
callback_url: '/api/batch-callback',
});
}
// Process batch at midnight
async function processBatch() {
const batch = await anthropic.batches.create({
requests: batchQueue,
});
batchQueue.length = 0;
// Results arrive via callback within 24h
}
If it does not need to be real-time, it should not be real-time. The 50% discount on batch processing is the easiest cost optimization for async workloads.
Monitoring and Budget Control
Cost Dashboard Essentials
Track these metrics in real-time:
- Cost per feature: Which AI feature costs the most?
- Cost per user segment: Power users vs casual users
- Cost per query (p50, p95): Catch expensive outliers
- Cache hit rate: Is your caching working?
- Model distribution: What percentage goes to each model?
- Daily spend vs budget: Alert at 80% of daily budget
Budget Alerts
| Alert Level | Trigger | Action |
|---|---|---|
| Warning | Daily spend > 120% of average | Notify engineering |
| Critical | Daily spend > 200% of average | Investigate immediately |
| Emergency | Monthly budget 80% consumed before month half | Rate limit non-critical features |
Per-User Rate Limiting
Prevent individual users from driving costs. A single power user making 500 queries/day costs as much as 50 normal users. Set per-user daily limits:
- Free tier: 20 AI queries/day
- Standard tier: 100 AI queries/day
- Enterprise: 500 AI queries/day + overage billing
FAQ
What monitoring tools do you recommend? For API cost tracking: build custom dashboards using provider API usage data (both OpenAI and Anthropic provide detailed usage logs). For infrastructure monitoring: Datadog with custom AI metrics, or Grafana with Prometheus. The cost of monitoring tools should be under 5% of your AI spend.
How should I allocate AI costs to product teams? Tag every API request with the feature and team that triggered it. Monthly cost reports per feature enable product teams to make cost-conscious decisions. The team that owns the feature should own its AI budget.
What is the ROI of optimization effort? Rule of thumb: 1 week of optimization effort saves 30-50% of monthly AI costs. For a $5,000/month AI spend, that is $1,500-$2,500/month in perpetual savings — the optimization pays for itself within 2 weeks.
Should I optimize before or after scaling? Before. Scaling an unoptimized system multiplies waste. Optimize at your current scale, then scale the optimized system. The cost curves are dramatically different.
Every dollar saved on AI infrastructure is a dollar that goes to product development or the bottom line. We help teams audit and optimize their AI spending.