Contact
AI

Анатомия продакшн ИИ-агента

Empirium Team12 min read

Every AI demo looks impressive. The agent books a meeting, summarizes a document, queries a database — all in a neat 90-second screen recording. Then you deploy the same agent for 500 real users and everything breaks.

The gap between demo and production is not about capability. The models are capable enough. The gap is about reliability, cost control, monitoring, and the hundred edge cases that demos never show. At Empirium, we have shipped AI agents handling tens of thousands of interactions per month. Here is what we learned about building agents that survive production.

Demo vs Production: The Gap

A demo agent operates in a controlled environment. The inputs are predictable, the context is clean, the evaluator is the builder. Production is the opposite.

Dimension Demo Production
Input quality Clean, well-formed Typos, ambiguity, injection attempts
Error rate tolerance "It mostly works" 99.5%+ success rate required
Latency expectation "Look, it responded" Under 2 seconds for interactive use
Cost visibility Free tier / personal API key $2,000–$50,000/month at scale
Monitoring Print statements Structured logging, alerting, dashboards
Failure handling Crash and restart Graceful degradation, fallback chains
Context management Single conversation Thousands of concurrent sessions

The first production agent we shipped had a 73% success rate in week one. Not because the model was bad — because we had not anticipated the diversity of real user inputs. A user typed "wht r ur prices???" and the agent parsed it as a weather question. Another user pasted an entire email thread into the chat and the agent hallucinated responses from the email signatures.

These are not model failures. They are architecture failures.

Production Agent Architecture

A production agent is not a prompt wrapper. It is a system with distinct components, each handling a specific concern.

Input Validation Layer

Every user input passes through validation before reaching the model:

User Input → Language Detection → Content Filtering → Intent Classification → Agent Router

Language detection catches inputs in unexpected languages and routes them appropriately. Content filtering blocks prompt injection attempts, PII in inputs that should not contain it, and gibberish. Intent classification determines which agent or tool chain handles the request.

This layer rejects approximately 8% of inputs before the model sees them. That saves 8% of your token budget and eliminates 8% of hallucination opportunities.

Tool Orchestration

Production agents use tools — API calls, database queries, file operations. The tool layer needs:

  • Timeout management: Every external call has a timeout. 5 seconds for database queries, 10 seconds for API calls, 30 seconds for complex operations. No exceptions.
  • Retry logic with backoff: Transient failures get retried with exponential backoff. Permanent failures (4xx errors) do not.
  • Result validation: Tool outputs are validated before being passed to the model. A database query that returns 10,000 rows does not get injected into the context window.
  • Permission boundaries: The agent can read from the CRM but cannot delete records. Tool permissions are enforced at the infrastructure level, not the prompt level.
const toolResult = await executeWithGuardrails({
  tool: 'crm_query',
  params: validatedParams,
  timeout: 5000,
  maxRetries: 2,
  validateOutput: (result) => result.rows.length < 100,
  permissions: ['read'],
});

State Management

Demo agents are stateless. Production agents need state across conversation turns, across sessions, and sometimes across users. We use a three-tier state model:

  1. Conversation state: Current context, tool results, partial workflows. Stored in Redis with a 30-minute TTL.
  2. Session state: User preferences, authentication context, conversation history summary. Stored in PostgreSQL.
  3. Global state: Knowledge base, configuration, shared context. Stored in the vector database and updated asynchronously.

Fallback Chains

When the primary model fails — timeout, rate limit, content filter — the agent needs a fallback path:

  1. Primary: Claude Sonnet via Anthropic API (best quality-to-cost ratio)
  2. Secondary: GPT-4o via OpenAI API (different provider for resilience)
  3. Tertiary: Pre-computed response templates for common queries
  4. Final: "I am experiencing technical difficulties. Let me connect you with a human." plus automatic ticket creation

Primary model availability sits around 99.7% over 12 months. That 0.3% represents roughly 1,800 failed requests per month at moderate scale. Without fallbacks, those are 1,800 frustrated users.

Monitoring and Observability

You cannot improve what you cannot measure, and AI agents have more failure modes than traditional software.

Token Usage Tracking

Every request logs input tokens, output tokens, total tokens, model used (primary or fallback), cache hit rate for prompt caching, and cost in dollars attributed to the specific feature.

We build dashboards that show token consumption by feature, user segment, and time of day. Anomalies — a sudden spike in token usage — trigger alerts. A bug in a content summarization agent once caused it to request the same context twice per query. Without token monitoring, that would have doubled costs for weeks before anyone noticed.

Response Quality Scoring

Every agent response gets an automated quality score:

  • Relevance score: Does the response address the user's question? Measured by semantic similarity.
  • Hallucination check: Does the response contain claims not supported by the retrieved context? Measured by NLI models.
  • Format compliance: Is JSON valid, are links real, are dates plausible?
  • Sentiment alignment: Is the tone appropriate? A complaint gets empathy. A product question gets directness.

Responses scoring below threshold get flagged for human review. Reviewing approximately 2% of all responses gives a statistically significant quality signal while keeping the review burden manageable.

Error Classification

Not all errors are equal:

  • Recoverable: Rate limits, transient API failures. The system retries automatically.
  • Degraded: Model returned low-quality output. The system serves the response with a confidence indicator and logs it for review.
  • Fatal: The agent cannot complete the task. The system escalates to a human and creates a support ticket.

Each category has different alerting thresholds. Recoverable errors alert at 5% rate. Degraded at 2%. Fatal errors alert immediately.

Cost Control at Scale

AI costs scale with usage, and usage is unpredictable. A viral social media post can 10x your traffic overnight. Without cost controls, that means a 10x bill.

Token Optimization

The single biggest cost reduction comes from prompt optimization. We routinely cut system prompts by 40-60% without affecting output quality:

  • Remove redundant instructions
  • Use structured formats instead of verbose descriptions
  • Move static context into cached prompt prefixes
  • Summarize conversation history instead of including full transcripts

Semantic Caching

Caching stores responses for similar queries. When a new query is semantically close to a cached query (cosine similarity > 0.95), the cached response is served. This reduces model API calls by 35-45% for customer support agents.

Model Routing

Not every query needs the most expensive model:

Query Type Model Cost per 1M tokens (input)
Simple FAQ Haiku / GPT-4o-mini $0.25
Standard conversation Sonnet / GPT-4o $3.00
Complex reasoning Opus / o1 $15.00

The router itself is a lightweight classifier adding less than 50ms of latency. Cost savings: 55-70% compared to routing everything through the best model.

Testing AI Agents

Traditional software testing assumes deterministic outputs. AI agents are non-deterministic. You need a different testing approach.

Evaluation Datasets

We maintain 200-500 test cases per agent, covering:

  • Golden path: Common queries with known-good responses
  • Edge cases: Unusual inputs, multilingual queries, very long inputs
  • Adversarial: Prompt injection attempts, off-topic questions, system prompt extraction attempts
  • Regression: Previously failed queries that have been fixed

Each test case has an expected outcome and an evaluation rubric scored automatically using an LLM-as-judge approach with human review for ambiguous cases.

Continuous Evaluation

Every deployment runs the evaluation suite. A quality drop of more than 2% blocks the deployment. Nightly evaluations against production traffic samples catch gradual quality degradation — model updates, data drift, or slow prompt rot.

FAQ

Which agent framework should I use? For simple agents, no framework — just direct API calls with your own orchestration. For complex multi-step agents, LangGraph offers the best balance of flexibility and structure. Avoid frameworks that abstract away too much; you need visibility into every step for production debugging.

What latency budget should I target? For interactive chat: under 2 seconds to first token with streaming. For background processing: under 30 seconds total. For voice agents: under 800ms to first token. These are user experience thresholds — exceeding them causes measurable drop-off.

When should I add human-in-the-loop? When the cost of a wrong answer exceeds the cost of human review. For customer support, that threshold is usually financial transactions, account changes, or legal commitments. Start with more human review and reduce it as confidence grows.

How do I convince stakeholders to invest in production infrastructure? Show them the failure rate of the demo version on real data. Run 100 real user queries through the demo agent and document every failure. The gap between demo quality and production requirements makes the case better than any slide deck.

Building production AI agents is infrastructure work, not prompt engineering. The model is 20% of the system. The other 80% is everything in this article. If you need help architecting AI agents that survive real users, talk to our team.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

Голосовые ИИ-агенты для продаж: реалистичное руководство

Продакшн-руководство — архитектура, платформы, затраты.

View all AI articles

Related Resources

Need help with this?

Talk to Empirium