Contact
AI

AI Phone Agents: Vapi vs Retell vs Bland vs Custom

Empirium Team12 min read

Phone-based AI agents are the fastest-growing category in business automation. A well-built voice agent handles appointment booking, lead qualification, and first-line support at a fraction of the cost of a human call center.

The challenge is latency. Humans expect conversational response times under 800ms. Anything slower feels like talking to a machine. The platform you choose determines whether your agent sounds natural or robotic.

We have tested every major platform at Empirium while building voice agents for clients. Here is the comparison.

The Voice AI Platform Landscape

Four approaches dominate the market:

  • Vapi: Developer-focused platform with the most flexibility. Bring your own LLM, TTS, and STT.
  • Retell: End-to-end platform optimized for low latency. Integrated LLM, voice, and telephony.
  • Bland: Sales-focused platform with pre-built campaigns, call routing, and CRM integrations.
  • Custom build: Assemble your own stack from STT, LLM, and TTS APIs connected via WebSockets.

Each approach trades off between ease of setup, customization depth, and operational control.

Technical Comparison

Latency Benchmarks

The critical metric is time to first speech byte — the delay between the user finishing their sentence and the agent starting its response. We measured across 100 test calls per platform:

Platform p50 Latency p99 Latency Interruption Handling
Vapi (with Claude Sonnet) 850ms 1,400ms Good — detects and stops within 200ms
Vapi (with GPT-4o) 720ms 1,200ms Good
Retell 600ms 1,000ms Excellent — native interruption detection
Bland 900ms 1,600ms Adequate — 300ms detection delay
Custom (Deepgram + Claude + ElevenLabs) 750ms 1,300ms Depends on implementation
Custom (Deepgram + GPT-4o-realtime) 450ms 800ms Excellent — native multimodal

Retell achieves the lowest latency through tight integration between their STT, LLM, and TTS components. Custom builds with GPT-4o's realtime API achieve even lower latency by eliminating the STT/TTS pipeline entirely — the model processes audio directly.

Voice Quality

Platform Voice Options Custom Voice Cloning Emotion/Prosody
Vapi ElevenLabs, PlayHT, Deepgram, Azure Yes (via ElevenLabs) Provider-dependent
Retell Proprietary + ElevenLabs Yes Good — natural intonation
Bland Proprietary Limited Basic
Custom Any TTS provider Yes Full control

ElevenLabs voices remain the gold standard for natural-sounding speech. Retell's proprietary voices are competitive and avoid the additional ElevenLabs cost.

Telephony Integration

Platform SIP Trunking Phone Numbers Call Transfer Recording
Vapi Twilio, Vonage Included (US/CA/UK) Warm + cold transfer Built-in
Retell Twilio Included (US) Warm transfer Built-in
Bland Proprietary Included (US/CA) Advanced routing Built-in
Custom Any SIP provider Self-managed Self-implemented Self-managed

Bland excels at telephony — their call routing, campaign management, and multi-line dialing are purpose-built for sales operations. Vapi offers the broadest telephony provider support.

Pricing Models

Voice AI pricing is notoriously complex. The "per minute" rate advertised rarely reflects the total cost.

Cost Breakdown Per 10-Minute Call

Component Vapi Retell Bland Custom
Platform fee $0.05/min = $0.50 $0.07/min = $0.70 $0.09/min = $0.90 $0.00
LLM (Claude Sonnet) ~$0.30 Included Included ~$0.30
TTS (ElevenLabs) ~$0.15 Included (if using built-in) Included ~$0.15
STT (Deepgram) ~$0.10 Included Included ~$0.10
Telephony ~$0.08 ~$0.08 Included ~$0.10
Total per 10-min call $1.13 $0.78 $0.90 $0.65
Cost per minute $0.113 $0.078 $0.090 $0.065

At 500 calls per day averaging 5 minutes each:

Platform Monthly Cost
Vapi ~$8,475
Retell ~$5,850
Bland ~$6,750
Custom ~$4,875

The custom approach is cheapest but requires 4-8 weeks of engineering to build and ongoing maintenance. Factor in $3,000-$5,000/month of engineering time and the cost advantage disappears below 300 calls/day.

Integration and Customization

CRM Integration

Platform Native CRM Custom Webhooks Tool Calling
Vapi HubSpot, Salesforce (via tools) Full LLM tool calling
Retell Limited native Custom functions
Bland HubSpot, Salesforce, GoHighLevel Pre-built actions
Custom Whatever you build Full control

For sales operations, Bland's native CRM integrations save significant setup time. For custom workflows, Vapi's tool calling gives the most flexibility — your agent can call any API during a conversation.

Transfer-to-Human Flows

The most critical integration for production voice agents. When the AI cannot handle a call, it must transfer smoothly to a human agent.

Vapi and Retell support warm transfers — the AI briefs the human agent on the conversation context before connecting. Bland supports advanced routing rules that can transfer based on caller intent, time of day, or agent availability.

Multilingual Support

Platform Languages Real-time Translation
Vapi 20+ (via provider) Via LLM
Retell 10+ Native
Bland English primary, 5+ limited No
Custom Unlimited Via LLM

For international operations, Vapi's provider flexibility gives the broadest language support. Retell's native multilingual handling is smoother but covers fewer languages.

Building Custom Voice Agents

When platform limitations become blockers — latency requirements, custom audio processing, or unique telephony needs — building custom is the answer.

The Custom Stack

Phone Network → SIP Trunk (Twilio/Telnyx)
    → WebSocket → STT (Deepgram streaming)
        → LLM (Claude/GPT-4o streaming)
            → TTS (ElevenLabs streaming)
                → Audio Buffer → WebSocket → Phone

Each component streams its output to the next. The key is maintaining a streaming pipeline end-to-end — if any component buffers the full response before passing it along, latency spikes.

Critical Implementation Details

Voice Activity Detection (VAD): Detecting when the user stops speaking is surprisingly hard. Background noise, pauses mid-sentence, and "um" sounds all create false triggers. Use Deepgram's or Silero's VAD with a 300-500ms silence threshold.

Interruption Handling: When the user speaks while the agent is still talking, the agent must stop immediately. This requires canceling the current TTS stream, discarding any buffered audio, and processing the user's new input. Most platform-specific bugs happen here.

Audio Buffering: Keep a 100-200ms audio buffer to smooth out network jitter. Larger buffers increase latency. Smaller buffers cause audio glitches on unstable connections.

When to Go Custom

  • You need sub-500ms latency (possible with GPT-4o realtime API)
  • You need custom audio processing (noise cancellation, accent adaptation)
  • You are making 1,000+ calls/day (cost savings justify the engineering investment)
  • You need telephony features that platforms do not support (custom IVR, conference calls)

FAQ

What call quality metrics should I track? Track latency (time to first speech), interruption rate (how often the agent talks over the user), task completion rate (did the call achieve its goal), and caller satisfaction (post-call survey or sentiment analysis). Aim for: latency under 1 second, interruption rate under 5%, task completion above 80%.

Do I need to comply with call recording laws? Yes. Two-party consent states (California, Illinois, and others) require informing the caller that the call is recorded and that they are speaking with an AI. Most platforms include a configurable greeting for this purpose. For regulated industries, consult legal counsel on additional disclosure requirements.

How well do voice agents handle accents and noisy environments? STT accuracy drops 10-20% with heavy accents and 15-30% in noisy environments. Deepgram's Nova model handles accents best. For noisy environments, pre-processing the audio with noise suppression (Krisp or RNNoise) before STT significantly improves accuracy.

Can I scale from 10 to 10,000 calls per day? Platforms handle scaling automatically. Custom builds require horizontal scaling of your WebSocket servers and careful management of concurrent connections. Plan for 1 server per 50-100 concurrent calls.

Voice AI is the highest-impact AI application for businesses that handle inbound calls. If you need help choosing a platform or building a custom voice agent, we can help.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

Voice AI Agents for Sales: A Realistic Implementation Guide

A production-focused guide to deploying voice AI agents for sales operations. Architecture, platform comparison, cost analysis, and the integration challenges nobody warns you about.

View all AI articles

Related Resources

Need help with this?

Talk to Empirium