Production Voice AI Orchestration: Why Two-Level Architecture Wins

Introduction

Voice agents that work in production almost always separate two concerns: the audio layer (real-time, transport-bound, latency-critical) and the cognitive layer (turn-based, transport-agnostic, reasoning-heavy). Mixing them is the fastest path to production failure.

This post shares the orchestration architecture we built for our Voice AI Platform — what each layer owns, why the separation matters, how the handoff actually works, and the practical engineering wins we get from keeping them apart.

If you're building voice agents — whether for support, lead qualification, intake, or outbound — this architecture will save you from the most common production failure modes we see in the field.

Why two levels: the fundamental constraint

Voice agents have to satisfy two fundamentally incompatible engineering requirements simultaneously.

Requirement 1: Hard real-time audio

Phone audio runs at 8kHz μ-law over WebSocket. Every 20ms, you get a 160-byte chunk. The system has to receive these chunks, detect when the speaker stops talking, send the transcript to the LLM, get a response, synthesize speech, and stream audio back — all while maintaining sub-1.5-second end-to-end latency for the conversation to feel natural.

This is a hard-real-time problem. Latency budgets are tight. State management is per-call. Failure modes are immediate and audible.

Requirement 2: Per-turn cognitive reasoning

Once you have the transcript, the LLM has to: classify intent, decide what to do, optionally call tools, commit state, and return a response. This is a stateful multi-step process that might take 500-2000ms depending on tool calls and model latency.

This is a transport-agnostic problem. The reasoning is the same whether the input came from a phone call, a browser voice test, or a typed message. Failure modes are recoverable (you can retry) and don't require sub-second response.

Mixing these two requirements in one process means: real-time audio code is interleaved with cognitive code, latency budgets get blurred, debugging is impossible, and you can't swap either layer without breaking the other.

The audio layer (CallOrchestrator)

The audio layer owns everything that's transport-bound and real-time. In our architecture, this is the CallOrchestrator — one instance per active phone call.

Responsibilities

Transport management: Twilio, Plivo, Exotel WebSockets for telephony; LiveKit WebRTC for browser-based voice tests. The CallOrchestrator handles connection lifecycle, audio I/O, and per-provider quirks.
STT streaming: Deepgram or Sarvam streaming transcription. Receive partial transcripts, detect final transcripts, handle interim updates.
VAD + barge-in: Detect when the user stops speaking (commit a turn), detect when the user interrupts the bot mid-response (cancel TTS, commit a new turn).
TTS streaming: Sarvam, ElevenLabs, or Deepgram TTS streaming back to the audio channel. Manage backpressure when TTS is faster than playback.
State machine: LISTENING → PROCESSING → SPEAKING transitions, with proper handling of interrupts and overlapping events.
Memory: Per-call turn history for context that flows into the cognitive layer.
Cost and latency tracking: Per-call billing data, latency budgets per stage.

Implementation details that matter

Some lessons we've learned the hard way about the audio layer:

Barge-in detection has to be aggressive. Users hate getting talked over. We detect within 200ms of user speech start and aggressively cancel in-flight TTS.
The state machine has to handle edge cases: user starts speaking while bot is loading first response; user hangs up mid-TTS; STT stream disconnects mid-utterance. Each of these has bitten us in production.
Audio backpressure matters. If TTS generates audio faster than the WebSocket can drain it, you either drop audio (bad) or buffer indefinitely (worse). Implement explicit backpressure.
The transport is not actually reliable. Telephony WebSockets disconnect, reconnect, drop packets. The state machine has to assume the transport can fail at any moment.

The cognitive layer (FlowOrchestrator)

The cognitive layer owns the per-turn reasoning. In our architecture, this is the FlowOrchestrator — invoked once per user utterance.

Responsibilities

Dispatch: LLM-based intent classification. Given the user's utterance and conversation context, what do they want?
Precedence policy: Multiple intents can match a single utterance. The precedence policy picks the winner (safety > explicit goal > implicit context > clarification > retry).
Chain execution: Walk through the conversation flow graph. Execute nodes (speak, decide, fetch, transition).
Tool calls: HTTP fetches to external systems with structured outputs and retry logic.
State commit: Persist conversation state to the database. Idempotent by turnId so retries don't corrupt state.
Return actions: Output is a list of actions (text to speak, transitions to make, state updates to commit). The audio layer executes these.

Why transport-agnostic matters

The cognitive layer doesn't know whether it's being called from a phone call, a browser voice test, an editor dry-run, or an integration test. This is the architectural win that pays off in every direction:

Browser voice tests run the same engine as production calls. Bug repros are exact. New flows can be tested without provisioning a phone number.
Dry-run mode in the flow editor uses the same engine. Designers test flows without audio. Engineers test changes without telephony.
Integration tests are trivial. No mocking the audio layer; just call the cognitive engine directly with synthetic transcripts.
Swapping STT/TTS providers is a config change. The cognitive layer doesn't care.
Adding a new transport (a different telephony provider) is bounded work. Implement the audio layer; the cognitive layer is reused as-is.

The handoff between layers

The two layers communicate through a narrow interface. The CallOrchestrator hands the FlowOrchestrator a turn; the FlowOrchestrator returns a list of actions.

typescript
// The handoff interface
interface TurnInput {
  callSid: string;
  turnId: string;
  transcript: string;
  agentFlowId: string;
  variables: Record<string, unknown>;
}

interface TurnOutput {
  actions: Action[];
  state: ConversationState;
}

type Action =
  | { kind: 'tts'; text: string }
  | { kind: 'transition'; nodeId: string }
  | { kind: 'end_call'; reason: string }
  | { kind: 'transfer'; to: string };

// The audio layer calls the cognitive layer like this:
const output = await flowOrchestrator.runTurn({
  callSid,
  turnId: generateTurnId(),
  transcript: finalizedTranscript,
  agentFlowId,
  variables: callContext,
});

// Then executes the returned actions:
for (const action of output.actions) {
  if (action.kind === 'tts') {
    await ttsStream(action.text);
  } else if (action.kind === 'end_call') {
    await endCall(action.reason);
  }
  // ...
}

The narrowness of this interface is the architectural feature. Both layers can evolve independently. The cognitive layer doesn't know what TTS provider is being used; the audio layer doesn't know what LLM is making decisions.

Production wins from this architecture

Concrete wins we've gotten from this two-level split in production:

Provider swapping without downtime. We've swapped STT providers (Deepgram ↔ Sarvam) and TTS providers (Sarvam ↔ ElevenLabs) without touching cognitive logic.
Browser voice testing for designers. Non-engineers can test new flows in the browser. Same engine, no telephony required.
Eval harnesses that don't need audio. Regression tests run the cognitive engine against thousands of synthetic transcripts in CI. No audio infrastructure needed.
Incident debugging via replay. Every production turn is logged with input transcript and output actions. We can replay any turn against the current engine to diagnose issues.
Independent scaling. Audio layer scales with concurrent calls; cognitive layer scales with utterance rate. They're different workloads with different infrastructure needs.

The two-level architecture isn't about elegance — it's about operational pain we've felt and prevented. Every team that tries to build voice agents with audio and cognition in one process eventually rebuilds with the split.

When you don't need this architecture

This pattern is overkill for some use cases:

Single-purpose voice bots with no expected reuse across transports.
Proof-of-concept demos where production reliability isn't the goal.
Teams without engineering capacity to build and maintain two layers.
Use cases where you're using a managed voice platform (Vapi, Retell, Synthflow) that abstracts this for you.

For production voice AI that you're building yourself and operating at any meaningful scale, though, the two-level split pays for itself within the first month.

Conclusion

If you're building production voice AI, design the two-level orchestration from day one. Mixing audio I/O and cognitive reasoning into one process is the fastest path to a system that breaks when you swap providers, add transports, or change conversation logic.

The architecture isn't hard to implement — but it's much harder to retrofit later, once you have existing flows, existing integrations, and existing production traffic. Build it right at the start.

If you're evaluating voice AI architectures for a project, we've shipped this pattern across multiple production deployments and are happy to share what we've learned. The specifics depend on your transport, model providers, and scale — but the two-level split is the right starting point for nearly every serious voice AI build.

Production Voice AI Orchestration: Why Two-Level Architecture Wins

Introduction

Why two levels: the fundamental constraint

Requirement 1: Hard real-time audio

Requirement 2: Per-turn cognitive reasoning

The audio layer (CallOrchestrator)

Responsibilities

Implementation details that matter

The cognitive layer (FlowOrchestrator)

Responsibilities

Why transport-agnostic matters

The handoff between layers

Production wins from this architecture

When you don't need this architecture

Conclusion

Table of Contents

Turn Your Vision IntoReality

Launch 40% Faster

Scale with Confidence

24-Hour Response