Анатомия продакшн ИИ-агента

8 мая 2026 г.Empirium Team12 min read

Read in:en fr es de it pt nl pl ru zh ja ko ar hi tr sv no da fi cs

Every AI demo looks impressive. The agent books a meeting, summarizes a document, queries a database — all in a neat 90-second screen recording. Then you deploy the same agent for 500 real users and everything breaks.

The gap between demo and production is not about capability. The models are capable enough. The gap is about reliability, cost control, monitoring, and the hundred edge cases that demos never show. At Empirium, we have shipped AI agents handling tens of thousands of interactions per month. Here is what we learned about building agents that survive production.

Demo vs Production: The Gap

A demo agent operates in a controlled environment. The inputs are predictable, the context is clean, the evaluator is the builder. Production is the opposite.

Dimension	Demo	Production
Input quality	Clean, well-formed	Typos, ambiguity, injection attempts
Error rate tolerance	"It mostly works"	99.5%+ success rate required
Latency expectation	"Look, it responded"	Under 2 seconds for interactive use
Cost visibility	Free tier / personal API key	$2,000–$50,000/month at scale
Monitoring	Print statements	Structured logging, alerting, dashboards
Failure handling	Crash and restart	Graceful degradation, fallback chains
Context management	Single conversation	Thousands of concurrent sessions

The first production agent we shipped had a 73% success rate in week one. Not because the model was bad — because we had not anticipated the diversity of real user inputs. A user typed "wht r ur prices???" and the agent parsed it as a weather question. Another user pasted an entire email thread into the chat and the agent hallucinated responses from the email signatures.

These are not model failures. They are architecture failures.

Production Agent Architecture

A production agent is not a prompt wrapper. It is a system with distinct components, each handling a specific concern.

Input Validation Layer

Every user input passes through validation before reaching the model:

User Input → Language Detection → Content Filtering → Intent Classification → Agent Router

Language detection catches inputs in unexpected languages and routes them appropriately. Content filtering blocks prompt injection attempts, PII in inputs that should not contain it, and gibberish. Intent classification determines which agent or tool chain handles the request.

This layer rejects approximately 8% of inputs before the model sees them. That saves 8% of your token budget and eliminates 8% of hallucination opportunities.

Tool Orchestration

Production agents use tools — API calls, database queries, file operations. The tool layer needs:

Timeout management: Every external call has a timeout. 5 seconds for database queries, 10 seconds for API calls, 30 seconds for complex operations. No exceptions.
Retry logic with backoff: Transient failures get retried with exponential backoff. Permanent failures (4xx errors) do not.
Result validation: Tool outputs are validated before being passed to the model. A database query that returns 10,000 rows does not get injected into the context window.
Permission boundaries: The agent can read from the CRM but cannot delete records. Tool permissions are enforced at the infrastructure level, not the prompt level.

const toolResult = await executeWithGuardrails({
  tool: 'crm_query',
  params: validatedParams,
  timeout: 5000,
  maxRetries: 2,
  validateOutput: (result) => result.rows.length < 100,
  permissions: ['read'],
});

State Management

Demo agents are stateless. Production agents need state across conversation turns, across sessions, and sometimes across users. We use a three-tier state model:

Conversation state: Current context, tool results, partial workflows. Stored in Redis with a 30-minute TTL.
Session state: User preferences, authentication context, conversation history summary. Stored in PostgreSQL.
Global state: Knowledge base, configuration, shared context. Stored in the vector database and updated asynchronously.

Fallback Chains

When the primary model fails — timeout, rate limit, content filter — the agent needs a fallback path:

Primary: Claude Sonnet via Anthropic API (best quality-to-cost ratio)
Secondary: GPT-4o via OpenAI API (different provider for resilience)
Tertiary: Pre-computed response templates for common queries
Final: "I am experiencing technical difficulties. Let me connect you with a human." plus automatic ticket creation

Primary model availability sits around 99.7% over 12 months. That 0.3% represents roughly 1,800 failed requests per month at moderate scale. Without fallbacks, those are 1,800 frustrated users.

Monitoring and Observability

You cannot improve what you cannot measure, and AI agents have more failure modes than traditional software.

Token Usage Tracking

Every request logs input tokens, output tokens, total tokens, model used (primary or fallback), cache hit rate for prompt caching, and cost in dollars attributed to the specific feature.

We build dashboards that show token consumption by feature, user segment, and time of day. Anomalies — a sudden spike in token usage — trigger alerts. A bug in a content summarization agent once caused it to request the same context twice per query. Without token monitoring, that would have doubled costs for weeks before anyone noticed.

Response Quality Scoring

Every agent response gets an automated quality score:

Relevance score: Does the response address the user's question? Measured by semantic similarity.
Hallucination check: Does the response contain claims not supported by the retrieved context? Measured by NLI models.
Format compliance: Is JSON valid, are links real, are dates plausible?
Sentiment alignment: Is the tone appropriate? A complaint gets empathy. A product question gets directness.

Responses scoring below threshold get flagged for human review. Reviewing approximately 2% of all responses gives a statistically significant quality signal while keeping the review burden manageable.

Error Classification

Not all errors are equal:

Recoverable: Rate limits, transient API failures. The system retries automatically.
Degraded: Model returned low-quality output. The system serves the response with a confidence indicator and logs it for review.
Fatal: The agent cannot complete the task. The system escalates to a human and creates a support ticket.

Each category has different alerting thresholds. Recoverable errors alert at 5% rate. Degraded at 2%. Fatal errors alert immediately.

Cost Control at Scale

AI costs scale with usage, and usage is unpredictable. A viral social media post can 10x your traffic overnight. Without cost controls, that means a 10x bill.

Token Optimization

The single biggest cost reduction comes from prompt optimization. We routinely cut system prompts by 40-60% without affecting output quality:

Remove redundant instructions
Use structured formats instead of verbose descriptions
Move static context into cached prompt prefixes
Summarize conversation history instead of including full transcripts

Semantic Caching

Caching stores responses for similar queries. When a new query is semantically close to a cached query (cosine similarity > 0.95), the cached response is served. This reduces model API calls by 35-45% for customer support agents.

Model Routing

Not every query needs the most expensive model:

Query Type	Model	Cost per 1M tokens (input)
Simple FAQ	Haiku / GPT-4o-mini	$0.25
Standard conversation	Sonnet / GPT-4o	$3.00
Complex reasoning	Opus / o1	$15.00

The router itself is a lightweight classifier adding less than 50ms of latency. Cost savings: 55-70% compared to routing everything through the best model.

Testing AI Agents

Traditional software testing assumes deterministic outputs. AI agents are non-deterministic. You need a different testing approach.

Evaluation Datasets

We maintain 200-500 test cases per agent, covering:

Golden path: Common queries with known-good responses
Edge cases: Unusual inputs, multilingual queries, very long inputs
Adversarial: Prompt injection attempts, off-topic questions, system prompt extraction attempts
Regression: Previously failed queries that have been fixed

Each test case has an expected outcome and an evaluation rubric scored automatically using an LLM-as-judge approach with human review for ambiguous cases.

Continuous Evaluation

Every deployment runs the evaluation suite. A quality drop of more than 2% blocks the deployment. Nightly evaluations against production traffic samples catch gradual quality degradation — model updates, data drift, or slow prompt rot.

FAQ

Which agent framework should I use? For simple agents, no framework — just direct API calls with your own orchestration. For complex multi-step agents, LangGraph offers the best balance of flexibility and structure. Avoid frameworks that abstract away too much; you need visibility into every step for production debugging.

What latency budget should I target? For interactive chat: under 2 seconds to first token with streaming. For background processing: under 30 seconds total. For voice agents: under 800ms to first token. These are user experience thresholds — exceeding them causes measurable drop-off.

When should I add human-in-the-loop? When the cost of a wrong answer exceeds the cost of human review. For customer support, that threshold is usually financial transactions, account changes, or legal commitments. Start with more human review and reduce it as confidence grows.

How do I convince stakeholders to invest in production infrastructure? Show them the failure rate of the demo version on real data. Run 100 real user queries through the demo agent and document every failure. The gap between demo quality and production requirements makes the case better than any slide deck.

Building production AI agents is infrastructure work, not prompt engineering. The model is 20% of the system. The other 80% is everything in this article. If you need help architecting AI agents that survive real users, talk to our team.

Анатомия продакшн ИИ-агента

Demo vs Production: The Gap

Production Agent Architecture

Input Validation Layer

Tool Orchestration

State Management

Fallback Chains

Monitoring and Observability

Token Usage Tracking

Response Quality Scoring

Error Classification

Cost Control at Scale

Token Optimization

Semantic Caching

Model Routing

Testing AI Agents

Evaluation Datasets

Continuous Evaluation

FAQ

Related Reading

From Other Pillars

Explore More

Голосовые ИИ-агенты для продаж: реалистичное руководство

More in AI

Голосовые ИИ-агенты для продаж: реалистичное руководство

RAG vs файн-тюнинг: когда что

Custom GPT, который реально работает

Паттерны интеграции LLM в SaaS

From Other Pillars

Сайты на заказ vs шаблоны: реальное сравнение затрат для B2B

Международное SEO в 2026: руководство по мультирегиональному ранжированию

Фингерпринтинг браузеров в 2026: что нужно знать

Техстек B2B-компании с $10M ARR

Related Resources

Key Terms

Common Questions

Compare

Services

Industries

Need help with this?