Overview
Chat interfaces have dominated how we interact with AI, but recent breakthroughs in multimodal AI are opening up exciting new possibilities. High-quality generative models and expressive text-to-speech (TTS) systems now make it possible to build agents that feel less like tools and more like conversational partners. Voice agents are one example of this. Instead of relying on a keyboard and mouse to type inputs into an agent, you can use spoken words to interact with it. This can be a more natural and engaging way to interact with AI, and can be especially useful for certain contexts.What are voice agents?
Voice agents are agents that can engage in natural spoken conversations with users. These agents combine speech recognition, natural language processing, generative AI, and text-to-speech technologies to create seamless, natural conversations. Theyβre suited for a variety of use cases, including:- Customer support
- Personal assistants
- Hands-free interfaces
- Coaching and training
How do voice agents work?
At a high level, every voice agent needs to handle three tasks:- Listen - capture audio and transcribe it
- Think - interpret intent, reason, plan
- Speak - generate audio and stream it back to the user
1. STT > Agent > TTS Architecture (The βSandwichβ)
The Sandwich architecture composes three distinct components: speech-to-text (STT), a text-based LangChain agent, and text-to-speech (TTS). Pros:- Full control over each component (swap STT/TTS providers as needed)
- Access to latest capabilities from modern text-modality models
- Transparent behavior with clear boundaries between components
- Requires orchestrating multiple services
- Additional complexity in managing the pipeline
- Conversion from speech to text loses information (e.g., tone, emotion)
2. Speech-to-Speech Architecture (S2S)
Speech-to-speech uses a multimodal model that processes audio input and generates audio output natively. Pros:- Simpler architecture with fewer moving parts
- Typically lower latency for simple interactions
- Direct audio processing captures tone and other nuances of speech
- Limited model options, greater risk of provider lock-in
- Features may lag behind text-modality models
- Less transparency in how audio is processed
- Reduced controllability and customization options
Demo application overview
Weβll walk through building a voice-based agent using the sandwich architecture. The agent will manage orders for a sandwich shop. The application will demonstrate all three components of the sandwich architecture, using AssemblyAI for STT and ElevenLabs for TTS (although adapters can be built for most providers). An end-to-end reference application is available in the voice-sandwich-demo repository. We will walk through that application here. The demo uses WebSockets for real-time bidirectional communication between the browser and server. The same architecture can be adapted for other transports like telephony systems (Twilio, Vonage) or WebRTC connections.Architecture
The demo implements a streaming pipeline where each stage processes data asynchronously: Client (Browser)- Captures microphone audio and encodes it as PCM
- Establishes WebSocket connection to the backend server
- Streams audio chunks to the server in real-time
- Receives and plays back synthesized speech audio
- Accepts WebSocket connections from clients
-
Orchestrates the three-step pipeline:
- Speech-to-text (STT): Forwards audio to the STT provider (e.g., AssemblyAI), receives transcript events
- Agent: Processes transcripts with LangChain agent, streams response tokens
- Text-to-speech (TTS): Sends agent responses to the TTS provider (e.g., ElevenLabs), receives audio chunks
- Returns synthesized audio to the client for playback
Setup
For detailed installation instructions and setup, see the repository README.1. Speech-to-text
The STT stage transforms an incoming audio stream into text transcripts. The implementation uses a producer-consumer pattern to handle audio streaming and transcript reception concurrently.Key Concepts
Producer-Consumer Pattern: Audio chunks are sent to the STT service concurrently with receiving transcript events. This allows transcription to begin before all audio has arrived. Event Types:stt_chunk: Partial transcripts provided as the STT service processes audiostt_output: Final, formatted transcripts that trigger agent processing
Implementation
AssemblyAI Client
AssemblyAI Client
2. LangChain agent
The agent stage processes text transcripts through a LangChain agent and streams the response tokens. In this case, we stream all text content blocks generated by the agent.Key Concepts
Streaming Responses: The agent usesstream_mode="messages" to emit response tokens as theyβre generated, rather than waiting for the complete response. This enables the TTS stage to begin synthesis immediately.
Conversation Memory: A checkpointer maintains conversation state across turns using a unique thread ID. This allows the agent to reference previous exchanges in the conversation.
Implementation
3. Text-to-speech
The TTS stage synthesizes agent response text into audio and streams it back to the client. Like the STT stage, it uses a producer-consumer pattern to handle concurrent text sending and audio reception.Key Concepts
Concurrent Processing: The implementation merges two async streams:- Upstream processing: Passes through all events and sends agent text chunks to the TTS provider
- Audio reception: Receives synthesized audio chunks from the TTS provider
Implementation
ElevenLabs Client
ElevenLabs Client