Contact
AI

B2B提示词工程

Empirium Team10 min read

Consumer AI can be approximate. A creative writing assistant that occasionally produces a mediocre paragraph is fine — the user edits it. A B2B AI system that occasionally miscategorizes a $200,000 deal in your CRM is a disaster.

B2B prompt engineering is a different discipline than the "write me a poem about cats" style guides that dominate the internet. The stakes are higher, the outputs need to be structured, and the margin for error is near zero.

Here are the patterns we use at Empirium when building production AI systems for business applications.

Why B2B Prompting Is Different

Four factors separate B2B prompting from consumer prompting:

Accuracy requirements: A consumer chatbot that is 90% accurate is useful. A B2B system that misquotes pricing, miscalculates commissions, or misclassifies support tickets 10% of the time creates more problems than it solves.

Structured outputs: Business systems consume JSON, not prose. Your AI's output feeds into CRMs, databases, and dashboards. If the JSON is malformed, the pipeline breaks. If a field contains an unexpected value, downstream systems fail silently.

Compliance constraints: In regulated industries, AI outputs may be audited. Every response needs to be traceable to its source data, and the reasoning path must be explainable.

Consistency: A consumer assistant can vary its style. A business system must produce the same output format, the same terminology, and the same level of detail every single time. Inconsistency breaks automation pipelines and confuses users.

Structured Output Patterns

JSON Mode

Most LLM providers offer JSON mode — the model is constrained to output valid JSON. This alone eliminates 80% of parsing errors.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: classificationPrompt }],
  // Anthropic: use tool_choice to force structured output
  tools: [{
    name: 'classify_ticket',
    description: 'Classify the support ticket',
    input_schema: {
      type: 'object',
      properties: {
        category: { 
          type: 'string', 
          enum: ['billing', 'technical', 'feature_request', 'complaint'] 
        },
        priority: { 
          type: 'string', 
          enum: ['low', 'medium', 'high', 'critical'] 
        },
        summary: { type: 'string', maxLength: 200 },
        suggested_action: { type: 'string' }
      },
      required: ['category', 'priority', 'summary']
    }
  }],
  tool_choice: { type: 'tool', name: 'classify_ticket' }
});

Using tool/function calling for structured outputs is more reliable than asking the model to "respond in JSON format." The schema acts as a contract — the model cannot return a field that is not in the schema, and required fields are always present.

XML Tags for Section Control

When you need the model to produce multiple distinct sections in a single response, XML tags provide clear boundaries:

Analyze the following sales call transcript and produce:

<analysis>
  <sentiment>Overall sentiment of the prospect (positive/neutral/negative)</sentiment>
  <objections>List each objection raised, one per line</objections>
  <next_steps>Recommended follow-up actions</next_steps>
  <deal_probability>Estimated close probability as a percentage</deal_probability>
</analysis>

XML tags are more reliable than markdown headers or numbered lists for parsing. They nest cleanly, they are unambiguous, and every programming language has an XML parser.

Enum Constraints

For classification tasks, explicitly enumerate all valid values:

Classify this email into EXACTLY ONE of these categories:
- SALES_INQUIRY: prospect asking about products or pricing
- SUPPORT_REQUEST: existing customer needing help
- PARTNERSHIP: business collaboration proposal  
- SPAM: irrelevant or unsolicited content
- INTERNAL: internal team communication

Respond with only the category name. No explanation.

The explicit enumeration with descriptions prevents the model from inventing categories. Without descriptions, the model may misunderstand the boundary between categories — is a pricing question from an existing customer a SALES_INQUIRY or SUPPORT_REQUEST? The descriptions resolve this.

Chain-of-Thought for Business Logic

Complex business rules are where models make the most mistakes. A prompt that says "apply our discount policy" will produce inconsistent results because the model interprets "discount policy" differently each time.

The Step-by-Step Pattern

Break complex rules into explicit reasoning steps:

Calculate the total price for this order using these rules IN ORDER:

Step 1 - Base price: Look up the unit price for each product in the catalog below.
Step 2 - Volume discount: If quantity > 100, apply 10% discount. If > 500, apply 20%.
Step 3 - Customer tier: If customer is "Enterprise," apply additional 5% discount.
Step 4 - Minimum margin: If final price per unit is below $X (cost + 15% margin), 
         flag as "NEEDS_APPROVAL" instead of calculating.
Step 5 - Tax: Apply the tax rate for the customer's state.

Show your work for each step before giving the final number.

Forcing the model to show intermediate steps:

  1. Makes errors visible and debuggable
  2. Improves accuracy by 15-25% on multi-step calculations
  3. Creates an audit trail for compliance

The Guardrail Pattern

For decisions with consequences, add explicit sanity checks:

Before finalizing your response, verify:
- [ ] All prices are positive numbers
- [ ] The total equals the sum of line items
- [ ] No discount exceeds 30% (our maximum allowed)
- [ ] The customer name matches the order

If any check fails, output "ERROR: [specific check that failed]" instead of the result.

This catches the model's own mistakes before they reach production. We have seen this pattern reduce errors from 8% to under 2% on financial calculation tasks.

Evaluation and Iteration

Prompt engineering without evaluation is just guessing. You need a systematic approach to measure quality and track improvements.

Building a Prompt Test Suite

Create a dataset of 50-100 representative inputs with expected outputs:

{
  "test_cases": [
    {
      "input": "I've been charged twice for my subscription",
      "expected_category": "billing",
      "expected_priority": "high",
      "expected_contains": ["refund", "duplicate charge"]
    },
    {
      "input": "Can you add dark mode to the dashboard?",
      "expected_category": "feature_request", 
      "expected_priority": "low"
    }
  ]
}

Run every prompt change against this test suite before deploying. Automated scoring compares the model's output against expected values. Any change that drops accuracy below threshold gets rejected.

Regression Testing

Every prompt change can improve one area while breaking another. The test suite catches regressions:

  1. Run the current prompt against all test cases. Record scores.
  2. Make the prompt change.
  3. Run the new prompt against all test cases.
  4. Compare scores. Accept only if overall accuracy improves AND no category drops more than 5%.

The Evaluation Loop

Week 1: Deploy prompt v1 → Monitor production outputs
Week 2: Sample 50 production outputs → Human review → Add failures to test suite
Week 3: Iterate on prompt v2 → Run test suite → Deploy if improved
Week 4: Repeat

Each cycle, the test suite grows with real-world failures. After 2-3 months, you have a comprehensive test suite that catches most failure modes before deployment.

Advanced Patterns

Few-Shot with Edge Cases

Instead of many examples of the happy path, include examples of the tricky cases:

Examples:

Input: "Your product is garbage" 
→ Category: COMPLAINT (not SUPPORT_REQUEST — no specific issue mentioned)

Input: "How much for 10,000 units?"
→ Category: SALES_INQUIRY (even though it mentions a number, it's asking for pricing)

Input: "The API returns 500 errors when I upload files over 10MB"
→ Category: SUPPORT_REQUEST (specific technical issue, not a complaint)

Edge case examples teach the model the decision boundaries that matter most.

Temperature and Sampling

For B2B use cases, set temperature to 0 or 0.1. Creativity is the enemy of consistency. You want the same input to produce the same output every time. The only exception is content generation tasks where variety is desired.

Prompt Versioning

Treat prompts like code:

  • Store prompts in version control (Git)
  • Tag each version with a test suite score
  • Roll back to the previous version if quality drops
  • Include the prompt version in your monitoring logs

FAQ

How do I handle prompt injection in B2B contexts? Input sanitization is the first defense — strip instructions, system prompt fragments, and unusual formatting before the input reaches the model. The second defense is output validation — check that the model's output conforms to expected formats and value ranges. The third is monitoring — flag responses that deviate from expected patterns. No single defense is sufficient.

Should we version prompts and who should own them? Yes. Prompts should live in your codebase, reviewed like code, and deployed through your CI/CD pipeline. Ownership depends on the use case — product managers own the intent (what the prompt should achieve), engineers own the implementation (how it achieves it), and domain experts review accuracy.

What happens when we switch models? Every model change requires re-running your evaluation suite. Different models interpret the same prompt differently. A prompt optimized for GPT-4o may underperform on Claude, and vice versa. Budget 1-2 days of prompt tuning per model migration.

How do I reduce costs without hurting quality? Start by removing unnecessary instructions from the system prompt. Then test whether a smaller, cheaper model achieves acceptable accuracy for your specific task. For classification tasks, GPT-4o-mini or Haiku often match larger models at 10x lower cost.

Effective prompt engineering is what separates AI features that deliver ROI from ones that get turned off. If your business needs reliable, production-grade prompts, our team builds them.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

销售语音AI代理:务实实施指南

生产级指南——架构、平台、成本。

View all AI articles

Related Resources

Need help with this?

Talk to Empirium