The Operator's Guide to AI Cost Optimization

May 8, 2026Empirium Team10 min read

Read in:en fr es de it pt nl pl ru zh ja ko ar hi tr sv no da fi cs

Your AI feature works. Users love it. Then you get the monthly invoice: $8,400. Your projected budget was $2,000. You have two options: cut the feature or cut the costs.

Most AI deployments are 3-5x more expensive than they need to be. Not because the models are overpriced — because the architecture is wasteful. Redundant tokens, missed cache opportunities, expensive models handling simple queries, and synchronous processing where batching would suffice.

Here is the optimization stack we apply at Empirium, in order of impact. Combined, these techniques reduce costs by 60-80% without measurable quality loss.

The Cost Optimization Stack

Five layers, each independent, each compounding:

Layer	Technique	Typical Savings	Implementation Effort
1	Prompt optimization	20-40%	1-2 days
2	Semantic caching	15-35%	3-5 days
3	Model routing	25-45%	5-10 days
4	Batching	10-20%	2-3 days
5	Self-hosting (high-volume routes)	40-70%	2-4 weeks

Layers 1-3 are quick wins. Layer 4 applies to async workloads. Layer 5 only makes sense at scale. Start from the top.

Prompt Optimization

The fastest, cheapest cost reduction. Most system prompts are 2-3x longer than they need to be.

Cut Redundant Instructions

Before (2,100 tokens):

You are a helpful, professional, and knowledgeable customer support assistant 
for Empirium. You should always be polite, respond accurately, and provide 
helpful information. Never make up information. If you don't know something, 
say so. Always try to be as helpful as possible while maintaining a 
professional tone...

After (380 tokens):

You are Empirium's support assistant. Answer from the provided context only. 
If the answer isn't in the context, say "I don't have that information" and 
suggest contacting [email protected]. Be concise and direct.

Same behavior. 82% fewer tokens. At 10,000 queries/day, that saves ~17M input tokens/day — $51/day on Claude Sonnet.

Use Prompt Caching

Anthropic and OpenAI both offer prompt caching. Static portions of your prompt (system message, tool definitions, static context) are cached on the provider side and billed at 90% discount on subsequent requests.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: systemPrompt, // 1000 tokens, cached after first request
      cache_control: { type: 'ephemeral' }
    }
  ],
  messages: [{ role: 'user', content: userQuery }],
});

For a 1,000-token system prompt at 10,000 queries/day:

Without caching: $30/day input cost
With caching: $3/day (first request full price, subsequent 90% off)
Savings: $810/month

Compress Conversation History

Multi-turn conversations accumulate tokens. A 10-message conversation repeats all previous messages with each API call. By message 10, you are paying for the full conversation context.

Instead of sending the full history, summarize it:

Full history (10 messages): ~8,000 tokens per request
Summarized context: ~500 tokens per request
Savings: 93% token reduction on context

Summarize every 5 messages or when context exceeds 4,000 tokens. Use a cheap model (Haiku/GPT-4o-mini) for summarization.

Optimize Output Length

Verbose responses cost more. If your system prompt says "explain in detail," you get 500-token responses. If it says "respond concisely in 1-3 sentences," you get 75-token responses.

Output tokens cost 3-5x more than input tokens. Cutting average response length from 400 to 150 tokens saves 60% on output costs.

Semantic Caching

35-45% of customer support queries are semantic duplicates. "What are your hours?" and "When do you open?" and "Are you open on weekends?" all have the same answer.

How It Works

Convert the incoming query to an embedding vector
Search your cache for vectors with cosine similarity > 0.93
If found: return the cached response (no API call)
If not found: call the API, cache the response with its embedding

async function queryWithCache(query: string): Promise<string> {
  const embedding = await embed(query);
  const cached = await vectorCache.search(embedding, threshold: 0.93);
  
  if (cached) {
    metrics.cacheHit++;
    return cached.response;
  }
  
  const response = await llm.chat(query);
  await vectorCache.store(embedding, response, ttl: 3600);
  metrics.cacheMiss++;
  return response;
}

Cache Configuration

Parameter	Recommended	Why
Similarity threshold	0.93-0.95	Below 0.93: wrong cached responses. Above 0.95: too few cache hits.
TTL	1-24 hours	Balances freshness vs cache hit rate. Shorter for dynamic data.
Cache size	10,000-50,000 entries	Covers most recurring query patterns.
Embedding model	text-embedding-3-small	Cheap ($0.02/1M tokens), good enough for similarity matching.

Cache Hit Rates by Use Case

Use Case	Typical Cache Hit Rate
Customer FAQ	40-55%
Product information	30-40%
Technical support	15-25%
Creative/content generation	5-10%
General conversation	10-20%

Model Routing

The most impactful optimization for diverse query workloads. Different queries need different models.

The Router Architecture

Query → Complexity Classifier (Haiku, ~5ms, ~$0.0001)
  ├→ Simple (FAQ, greetings, status checks) → Haiku/GPT-4o-mini
  ├→ Standard (support, search, summaries) → Sonnet/GPT-4o
  └→ Complex (analysis, reasoning, multi-step) → Opus/GPT-4o

The classifier itself is a cheap, fast model call that adds less than $0.01/day in overhead and 30-50ms of latency.

Cost Impact

Query Distribution	All-Sonnet Cost	Routed Cost	Savings
50% simple, 35% standard, 15% complex	$3,000/month	$1,200/month	60%
30% simple, 50% standard, 20% complex	$3,000/month	$1,800/month	40%
10% simple, 40% standard, 50% complex	$3,000/month	$2,400/month	20%

The more queries you can route to cheap models, the more you save. Invest in making your prompts work well on smaller models — it pays for itself immediately.

Quality Guardrails

Model routing introduces a risk: the cheap model gives a bad answer because the query was misclassified. Mitigate with:

Confidence scoring: If the cheap model's confidence is below threshold, re-route to the expensive model
Output validation: Check response format and content against expected patterns
Sampling: Route 5% of "simple" queries to the expensive model and compare outputs. If divergence is high, the classifier needs tuning.

Batching

For non-real-time workloads, batch processing offers 50% discounts from most providers.

Anthropic Batch API

Submit requests in bulk. Results delivered within 24 hours at 50% discount.

Use cases:

Nightly content generation (product descriptions, email campaigns)
Document processing (classification, extraction, summarization)
Report generation (weekly analytics summaries, lead scoring updates)
Evaluation pipelines (running test suites against production samples)

Implementation Pattern

// Collect requests during the day
const batchQueue: BatchRequest[] = [];

function queueForBatch(request: ChatRequest) {
  batchQueue.push({
    custom_id: generateId(),
    params: request,
    callback_url: '/api/batch-callback',
  });
}

// Process batch at midnight
async function processBatch() {
  const batch = await anthropic.batches.create({
    requests: batchQueue,
  });
  batchQueue.length = 0;
  // Results arrive via callback within 24h
}

If it does not need to be real-time, it should not be real-time. The 50% discount on batch processing is the easiest cost optimization for async workloads.

Monitoring and Budget Control

Cost Dashboard Essentials

Track these metrics in real-time:

Cost per feature: Which AI feature costs the most?
Cost per user segment: Power users vs casual users
Cost per query (p50, p95): Catch expensive outliers
Cache hit rate: Is your caching working?
Model distribution: What percentage goes to each model?
Daily spend vs budget: Alert at 80% of daily budget

Budget Alerts

Alert Level	Trigger	Action
Warning	Daily spend > 120% of average	Notify engineering
Critical	Daily spend > 200% of average	Investigate immediately
Emergency	Monthly budget 80% consumed before month half	Rate limit non-critical features

Per-User Rate Limiting

Prevent individual users from driving costs. A single power user making 500 queries/day costs as much as 50 normal users. Set per-user daily limits:

Free tier: 20 AI queries/day
Standard tier: 100 AI queries/day
Enterprise: 500 AI queries/day + overage billing

FAQ

What monitoring tools do you recommend? For API cost tracking: build custom dashboards using provider API usage data (both OpenAI and Anthropic provide detailed usage logs). For infrastructure monitoring: Datadog with custom AI metrics, or Grafana with Prometheus. The cost of monitoring tools should be under 5% of your AI spend.

How should I allocate AI costs to product teams? Tag every API request with the feature and team that triggered it. Monthly cost reports per feature enable product teams to make cost-conscious decisions. The team that owns the feature should own its AI budget.

What is the ROI of optimization effort? Rule of thumb: 1 week of optimization effort saves 30-50% of monthly AI costs. For a $5,000/month AI spend, that is $1,500-$2,500/month in perpetual savings — the optimization pays for itself within 2 weeks.

Should I optimize before or after scaling? Before. Scaling an unoptimized system multiplies waste. Optimize at your current scale, then scale the optimized system. The cost curves are dramatically different.

Every dollar saved on AI infrastructure is a dollar that goes to product development or the bottom line. We help teams audit and optimize their AI spending.

The Cost Optimization Stack

Prompt Optimization

Cut Redundant Instructions

Use Prompt Caching

Compress Conversation History

Optimize Output Length

Semantic Caching

How It Works

Cache Configuration

Cache Hit Rates by Use Case

Model Routing

The Router Architecture

Cost Impact

Quality Guardrails

Batching

Anthropic Batch API

Implementation Pattern

Monitoring and Budget Control

Cost Dashboard Essentials

Budget Alerts

Per-User Rate Limiting

FAQ

Related Reading

From Other Pillars

Explore More

Voice AI Agents for Sales: A Realistic Implementation Guide

More in AI

Voice AI Agents for Sales: A Realistic Implementation Guide

The Anatomy of a Production AI Agent

RAG vs Fine-Tuning: When to Use Which

Building a Custom GPT That Actually Works for Your Business

From Other Pillars

The Real Cost of Cheap Hosting (And What Operators Use Instead)

The Modern Marketing Operations Stack: A Reference Architecture

Browser Fingerprinting in 2026: What Operators Need to Know

Related Resources

Key Terms

Common Questions

Compare

Services

Industries

Need help with this?