Contact
AI

The Operator's Guide to AI Cost Optimization

Empirium Team10 min read

Your AI feature works. Users love it. Then you get the monthly invoice: $8,400. Your projected budget was $2,000. You have two options: cut the feature or cut the costs.

Most AI deployments are 3-5x more expensive than they need to be. Not because the models are overpriced — because the architecture is wasteful. Redundant tokens, missed cache opportunities, expensive models handling simple queries, and synchronous processing where batching would suffice.

Here is the optimization stack we apply at Empirium, in order of impact. Combined, these techniques reduce costs by 60-80% without measurable quality loss.

The Cost Optimization Stack

Five layers, each independent, each compounding:

Layer Technique Typical Savings Implementation Effort
1 Prompt optimization 20-40% 1-2 days
2 Semantic caching 15-35% 3-5 days
3 Model routing 25-45% 5-10 days
4 Batching 10-20% 2-3 days
5 Self-hosting (high-volume routes) 40-70% 2-4 weeks

Layers 1-3 are quick wins. Layer 4 applies to async workloads. Layer 5 only makes sense at scale. Start from the top.

Prompt Optimization

The fastest, cheapest cost reduction. Most system prompts are 2-3x longer than they need to be.

Cut Redundant Instructions

Before (2,100 tokens):

You are a helpful, professional, and knowledgeable customer support assistant 
for Empirium. You should always be polite, respond accurately, and provide 
helpful information. Never make up information. If you don't know something, 
say so. Always try to be as helpful as possible while maintaining a 
professional tone...

After (380 tokens):

You are Empirium's support assistant. Answer from the provided context only. 
If the answer isn't in the context, say "I don't have that information" and 
suggest contacting [email protected]. Be concise and direct.

Same behavior. 82% fewer tokens. At 10,000 queries/day, that saves ~17M input tokens/day — $51/day on Claude Sonnet.

Use Prompt Caching

Anthropic and OpenAI both offer prompt caching. Static portions of your prompt (system message, tool definitions, static context) are cached on the provider side and billed at 90% discount on subsequent requests.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: systemPrompt, // 1000 tokens, cached after first request
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [{ role: 'user', content: userQuery }],
});

For a 1,000-token system prompt at 10,000 queries/day:

  • Without caching: $30/day input cost
  • With caching: $3/day (first request full price, subsequent 90% off)
  • Savings: $810/month

Compress Conversation History

Multi-turn conversations accumulate tokens. A 10-message conversation repeats all previous messages with each API call. By message 10, you are paying for the full conversation context.

Instead of sending the full history, summarize it:

Full history (10 messages): ~8,000 tokens per request
Summarized context: ~500 tokens per request
Savings: 93% token reduction on context

Summarize every 5 messages or when context exceeds 4,000 tokens. Use a cheap model (Haiku/GPT-4o-mini) for summarization.

Optimize Output Length

Verbose responses cost more. If your system prompt says "explain in detail," you get 500-token responses. If it says "respond concisely in 1-3 sentences," you get 75-token responses.

Output tokens cost 3-5x more than input tokens. Cutting average response length from 400 to 150 tokens saves 60% on output costs.

Semantic Caching

35-45% of customer support queries are semantic duplicates. "What are your hours?" and "When do you open?" and "Are you open on weekends?" all have the same answer.

How It Works

  1. Convert the incoming query to an embedding vector
  2. Search your cache for vectors with cosine similarity > 0.93
  3. If found: return the cached response (no API call)
  4. If not found: call the API, cache the response with its embedding
async function queryWithCache(query: string): Promise<string> {
  const embedding = await embed(query);
  const cached = await vectorCache.search(embedding, threshold: 0.93);
  
  if (cached) {
    metrics.cacheHit++;
    return cached.response;
  }
  
  const response = await llm.chat(query);
  await vectorCache.store(embedding, response, ttl: 3600);
  metrics.cacheMiss++;
  return response;
}

Cache Configuration

Parameter Recommended Why
Similarity threshold 0.93-0.95 Below 0.93: wrong cached responses. Above 0.95: too few cache hits.
TTL 1-24 hours Balances freshness vs cache hit rate. Shorter for dynamic data.
Cache size 10,000-50,000 entries Covers most recurring query patterns.
Embedding model text-embedding-3-small Cheap ($0.02/1M tokens), good enough for similarity matching.

Cache Hit Rates by Use Case

Use Case Typical Cache Hit Rate
Customer FAQ 40-55%
Product information 30-40%
Technical support 15-25%
Creative/content generation 5-10%
General conversation 10-20%

Model Routing

The most impactful optimization for diverse query workloads. Different queries need different models.

The Router Architecture

Query → Complexity Classifier (Haiku, ~5ms, ~$0.0001)
  ├→ Simple (FAQ, greetings, status checks) → Haiku/GPT-4o-mini
  ├→ Standard (support, search, summaries) → Sonnet/GPT-4o
  └→ Complex (analysis, reasoning, multi-step) → Opus/GPT-4o

The classifier itself is a cheap, fast model call that adds less than $0.01/day in overhead and 30-50ms of latency.

Cost Impact

Query Distribution All-Sonnet Cost Routed Cost Savings
50% simple, 35% standard, 15% complex $3,000/month $1,200/month 60%
30% simple, 50% standard, 20% complex $3,000/month $1,800/month 40%
10% simple, 40% standard, 50% complex $3,000/month $2,400/month 20%

The more queries you can route to cheap models, the more you save. Invest in making your prompts work well on smaller models — it pays for itself immediately.

Quality Guardrails

Model routing introduces a risk: the cheap model gives a bad answer because the query was misclassified. Mitigate with:

  • Confidence scoring: If the cheap model's confidence is below threshold, re-route to the expensive model
  • Output validation: Check response format and content against expected patterns
  • Sampling: Route 5% of "simple" queries to the expensive model and compare outputs. If divergence is high, the classifier needs tuning.

Batching

For non-real-time workloads, batch processing offers 50% discounts from most providers.

Anthropic Batch API

Submit requests in bulk. Results delivered within 24 hours at 50% discount.

Use cases:

  • Nightly content generation (product descriptions, email campaigns)
  • Document processing (classification, extraction, summarization)
  • Report generation (weekly analytics summaries, lead scoring updates)
  • Evaluation pipelines (running test suites against production samples)

Implementation Pattern

// Collect requests during the day
const batchQueue: BatchRequest[] = [];

function queueForBatch(request: ChatRequest) {
  batchQueue.push({
    custom_id: generateId(),
    params: request,
    callback_url: '/api/batch-callback',
  });
}

// Process batch at midnight
async function processBatch() {
  const batch = await anthropic.batches.create({
    requests: batchQueue,
  });
  batchQueue.length = 0;
  // Results arrive via callback within 24h
}

If it does not need to be real-time, it should not be real-time. The 50% discount on batch processing is the easiest cost optimization for async workloads.

Monitoring and Budget Control

Cost Dashboard Essentials

Track these metrics in real-time:

  • Cost per feature: Which AI feature costs the most?
  • Cost per user segment: Power users vs casual users
  • Cost per query (p50, p95): Catch expensive outliers
  • Cache hit rate: Is your caching working?
  • Model distribution: What percentage goes to each model?
  • Daily spend vs budget: Alert at 80% of daily budget

Budget Alerts

Alert Level Trigger Action
Warning Daily spend > 120% of average Notify engineering
Critical Daily spend > 200% of average Investigate immediately
Emergency Monthly budget 80% consumed before month half Rate limit non-critical features

Per-User Rate Limiting

Prevent individual users from driving costs. A single power user making 500 queries/day costs as much as 50 normal users. Set per-user daily limits:

  • Free tier: 20 AI queries/day
  • Standard tier: 100 AI queries/day
  • Enterprise: 500 AI queries/day + overage billing

FAQ

What monitoring tools do you recommend? For API cost tracking: build custom dashboards using provider API usage data (both OpenAI and Anthropic provide detailed usage logs). For infrastructure monitoring: Datadog with custom AI metrics, or Grafana with Prometheus. The cost of monitoring tools should be under 5% of your AI spend.

How should I allocate AI costs to product teams? Tag every API request with the feature and team that triggered it. Monthly cost reports per feature enable product teams to make cost-conscious decisions. The team that owns the feature should own its AI budget.

What is the ROI of optimization effort? Rule of thumb: 1 week of optimization effort saves 30-50% of monthly AI costs. For a $5,000/month AI spend, that is $1,500-$2,500/month in perpetual savings — the optimization pays for itself within 2 weeks.

Should I optimize before or after scaling? Before. Scaling an unoptimized system multiplies waste. Optimize at your current scale, then scale the optimized system. The cost curves are dramatically different.

Every dollar saved on AI infrastructure is a dollar that goes to product development or the bottom line. We help teams audit and optimize their AI spending.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

Voice AI Agents for Sales: A Realistic Implementation Guide

A production-focused guide to deploying voice AI agents for sales operations. Architecture, platform comparison, cost analysis, and the integration challenges nobody warns you about.

View all AI articles

Related Resources

Need help with this?

Talk to Empirium