Contact
AI

LLM評価:AIが本当に良いか?

Empirium Team11 min read

"It seems to work pretty well" is not an evaluation. It is a feeling. And feelings do not survive the first week of production when a customer screenshots a wrong answer and posts it on social media.

Evaluating LLM outputs is the hardest part of building AI systems because the outputs are non-deterministic, quality is partially subjective, and the failure modes are diverse. But evaluation is not optional — it is what separates production AI from demos.

Here is the evaluation framework we use at Empirium across every AI system we ship.

Why Evaluation Is the Hardest Part

Traditional software has deterministic outputs. Input A always produces output B. If it does not, that is a bug. Fix it, write a test, move on.

LLMs are different:

  • Non-deterministic: The same input can produce different outputs on different runs (even at temperature 0, due to batching and hardware variations).
  • Subjectively correct: "Summarize this article" has no single correct answer. Multiple summaries can all be good. Some bad summaries can look good at first glance.
  • Context-dependent quality: A response that is perfect for an expert user might be terrible for a novice. Quality depends on who is asking and why.
  • Diverse failure modes: The model can be wrong about facts, wrong about tone, wrong about format, or right about everything but miss the point entirely.

These challenges do not make evaluation impossible. They make it require a different methodology.

Evaluation Metrics for Business Use Cases

Generic benchmarks (MMLU, HumanEval) tell you about the model. They tell you nothing about your application. You need application-specific metrics.

Core Metrics

Metric What It Measures How to Measure Target
Task accuracy Does the output correctly complete the task? Automated checks + human sampling > 95%
Consistency Same question → same quality answer? Run 50 queries 3x each, compare Variance < 5%
Latency (p50/p99) How fast? Instrumentation p50 < 2s, p99 < 5s
Cost per query How expensive? Token logging Within budget
Hallucination rate How often does it make things up? NLI check against source docs < 3%
Format compliance Does the output match expected format? Schema validation > 99%

Domain-Specific Metrics

Add metrics specific to your use case:

  • Customer support: Resolution rate (query resolved without human), escalation rate, customer satisfaction score
  • Content generation: Readability score, brand voice compliance, factual accuracy
  • Classification: Precision, recall, F1 per class, confusion matrix
  • fine-tuning-comparison">RAG systems: Retrieval relevance (are the right documents retrieved?), answer groundedness (is the answer supported by retrieved docs?)

The Metric Hierarchy

Not all metrics are equal. Prioritize:

  1. Safety: Does the AI ever produce harmful, offensive, or legally risky outputs? This is a hard zero tolerance.
  2. Accuracy: Does the AI give correct answers? This determines whether the system delivers value.
  3. Consistency: Does the AI perform reliably across inputs? This determines whether users can trust it.
  4. Latency and cost: Is the system fast and affordable enough? This determines sustainability.

Optimizing cost at the expense of accuracy is always wrong. Optimizing latency at the expense of safety is always wrong. The hierarchy matters.

Building an Evaluation Pipeline

Step 1: Create an Evaluation Dataset

Your evaluation dataset is the ground truth. It needs:

  • 50-100 test cases minimum for initial development. 200-500 for production evaluation.
  • Representative distribution: If 60% of your real queries are FAQ-type, 60% of your test cases should be FAQ-type.
  • Edge cases: At least 20% of test cases should be edge cases, adversarial inputs, and unusual queries.
  • Expected outputs: Not exact expected responses — expected outcomes. "The response should mention the return policy and include a link."
{
  "id": "eval_042",
  "input": "I want to return my order but I lost the receipt",
  "context": "Order #12345, purchased 2 weeks ago, return policy: 30 days with receipt, 14 days without",
  "expected": {
    "must_contain": ["14 days", "without receipt"],
    "must_not_contain": ["30 days with receipt"],
    "tone": "empathetic",
    "action": "initiate_return OR escalate_to_human",
    "format": "natural_language"
  }
}

Step 2: Automated Scoring

Each test case is scored automatically across multiple dimensions:

function scoreResponse(response: string, expected: Expected): Score {
  return {
    containsRequired: expected.must_contain.every(
      phrase => response.toLowerCase().includes(phrase.toLowerCase())
    ),
    excludesForbidden: expected.must_not_contain.every(
      phrase => !response.toLowerCase().includes(phrase.toLowerCase())
    ),
    formatValid: validateFormat(response, expected.format),
    lengthAppropriate: response.length > 50 && response.length < 2000,
    groundedness: await checkGroundedness(response, expected.context),
  };
}

Automated scoring catches 70-80% of issues. The remaining 20-30% require human judgment or LLM-as-judge.

Step 3: Human Review Sampling

Not every response needs human review. Sample strategically:

  • All low-confidence responses: Where automated scoring is uncertain
  • Random 5% sample: For unbiased quality estimation
  • All new failure patterns: First occurrence of a previously unseen error type
  • Periodic full review: Monthly review of 100 randomly sampled production responses

Human reviewers score on a 1-5 scale across relevance, accuracy, tone, and helpfulness. Inter-rater reliability (Cohen's kappa) should be above 0.7 — if reviewers disagree too often, the rubric needs improvement.

Step 4: Continuous Monitoring

Evaluation is not a one-time event. Quality drifts over time due to:

  • Model updates (provider-side changes)
  • Data drift (user queries evolve)
  • Knowledge base staleness
  • Prompt rot (prompts optimized for old model behavior)

Run your evaluation suite:

  • On every deployment: Blocks deployment if quality drops > 2%
  • Nightly: Against a sample of that day's production traffic
  • Weekly: Full evaluation suite against the complete test dataset

LLM-as-Judge

Using one LLM to evaluate another's output is increasingly common and surprisingly effective — with the right calibration.

How It Works

A "judge" model receives the original query, the response being evaluated, and the evaluation criteria. It scores the response.

Judge prompt:
You are evaluating a customer support AI response.

Query: {user_query}
Context: {retrieved_context}
Response: {ai_response}

Score the response on these criteria (1-5 each):
1. Relevance: Does it address the query?
2. Accuracy: Are all claims supported by the context?
3. Helpfulness: Does it move the user toward resolution?
4. Tone: Is it appropriate for a support interaction?

For each score, provide a one-sentence justification.
Output as JSON: {"relevance": N, "accuracy": N, "helpfulness": N, "tone": N, "justifications": {...}}

When LLM-as-Judge Works

  • Comparative evaluation: "Is response A better than response B?" LLMs are surprisingly good at relative comparisons.
  • Multi-dimensional scoring: Scoring across 4-6 dimensions simultaneously. Faster and cheaper than human reviewers.
  • Scale: Evaluating thousands of responses where human review is impractical.

When LLM-as-Judge Fails

  • Factual verification: The judge model cannot verify facts it does not know. If the response claims "our product supports 24 languages" and the judge does not have your product data, it cannot verify this.
  • Self-bias: Models tend to rate their own outputs higher than competitors'. Use a different model family for judging (Claude to judge GPT outputs, or vice versa).
  • Subtle errors: Confidently wrong answers often receive high scores because the judge evaluates fluency and structure, not factual accuracy.

Calibration

Before trusting LLM-as-judge, calibrate against human ratings:

  1. Have humans score 100 responses
  2. Have the LLM judge score the same 100 responses
  3. Calculate correlation (Pearson or Spearman)
  4. If correlation > 0.8, the judge is reliable for your use case
  5. If correlation < 0.7, adjust the judge prompt or switch to human review

We typically see 0.75-0.85 correlation after prompt tuning, which is sufficient for production monitoring but not for high-stakes decisions.

Evaluation Anti-Patterns

Vibes-Based Evaluation

"I tried a few queries and it seemed good." This catches zero systematic issues. Always use a structured evaluation dataset.

Overfitting to the Test Set

If you tune your prompt until it scores 100% on your test set, you have overfit. The prompt will fail on queries the test set does not cover. Keep a held-out test set (20% of cases) that you never optimize against.

Evaluating Only the Happy Path

If your test set is all well-formed, polite, English-language queries, your evaluation is meaningless. Real users send typos, ambiguous questions, multi-turn conversations, and adversarial inputs. Your test set must reflect this diversity.

One-Time Evaluation

Evaluating once and deploying is like testing software once and never again. Models change, data drifts, and users evolve. Continuous evaluation is the only evaluation that matters.

FAQ

How large should my evaluation dataset be? Start with 50-100 cases for development. Scale to 200-500 for production. For high-stakes applications (regulated industries), 1,000+ cases with quarterly updates. The marginal value of additional cases diminishes above 500 for most applications.

How often should I run regression tests? On every deployment (automated, blocks deployment on regression). Nightly against production samples. Full suite weekly. This cadence catches issues within 24 hours while keeping evaluation costs manageable.

How do I evaluate generative tasks vs classification tasks? Classification is straightforward — measure precision, recall, F1. Generative tasks need multi-dimensional scoring (relevance, accuracy, tone, format). Use LLM-as-judge for the subjective dimensions and automated checks for the objective ones (format, length, required content).

What is the ROI of building an evaluation pipeline? The first production incident that evaluation prevents pays for the entire pipeline. A wrong answer reaching a customer costs 10-100x more than the evaluation infrastructure. Budget 10-15% of your AI project cost for evaluation — it is the highest-ROI investment in the project.

Evaluation is the foundation of trustworthy AI. If you are building AI systems that need to be reliable, we can help design your evaluation framework.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

営業向け音声AIエージェント:実装ガイド

本番向けガイド——アーキテクチャ、プラットフォーム、コスト。

View all AI articles

Related Resources

Need help with this?

Talk to Empirium