logo
logo

Turn Detection in Voice AI ?

imageDeepesh Agrawal
Published on 23 Sep 2025
blogImage

The Problem That Started It All

Our voice agents were working well functionally, but the conversational flow felt unnatural. Users would pause to think or search for the right words (many of our users switch between Hindi and English), and our agent would jump in prematurely. These interruptions weren't just annoying – they were breaking the natural flow of conversation and impacting user experience.

Going Under the Hood: LiveKit's Turn Detection Architecture

Determined to solve this, our team dove deep into LiveKit's SDK source code. Here's what we found lurking in the plugins directory – the turn detection system that powers millions of voice interactions.

LiveKit provides two main turn detection models:

  • English-only model: Optimized for English conversations
  • Multilingual model: Supports multiple languages (crucial for our Indian users who often mix Hindi and English)

The Algorithm Revealed

After analyzing the code, we uncovered how LiveKit's turn detection actually works:

  1. Context Analysis: The model takes recent conversation context along with the current utterance
  2. EOU Probability Calculation: It generates an "End of Utterance" probability score
  3. Threshold-Based Decision Making: This is where it gets interesting...

The Two-Delay System

LiveKit uses a clever dual-delay mechanism:

When EOU probability > threshold:

  • Triggers min_endpoint_delay
  • The system is confident the user is done speaking
  • Shorter wait time for snappy, responsive conversations

When EOU probability < threshold:

  • Triggers max_endpoint_delay (usually 800-1200ms)
  • The system is unsure, so it gives the user more thinking time
  • Longer wait time to avoid awkward interruptions

Think of it like a careful conversational partner - when they're pretty sure you're done talking, they jump in quickly. When they're not sure, they wait a bit longer to see if you'll continue.

The fascinating (and problematic) part? LiveKit uses surprisingly low thresholds for triggering the faster min_endpoint_delay. This aggressive approach explains why interruptions happen so frequently - the system is too eager to classify utterances as "complete" and jumps in with the shorter delay.

The Missing Pieces

Here's where it gets interesting (and frustrating). While LiveKit provides model weights on GitHub, they don't share the complete training pipeline or architecture details. This black-box approach makes it difficult for developers to truly understand and optimize the system for their specific use cases.

My Key Findings

1. Punctuation Paradox

Initially, we suspected that STT-generated punctuation was confusing the turn detection model. However, our investigation revealed that LiveKit actually strips out punctuation before processing! This means the model is purely relying on semantic understanding of context and meaning – which is both impressive and explains some of the challenges.

2. Context is King

The model doesn't just look at the current sentence in isolation. It considers conversation context to make turn detection decisions. This explains why it generally works well for standard conversations but struggles with domain-specific interactions like healthcare consultations.

3. Ongoing Evolution

LiveKit is actively improving these models. The system is clearly a work in progress, with regular updates to improve accuracy.

A Simple But Effective Workaround

While we wait for better turn detection models, there is a practical prompt engineering solution that significantly reduce interruptions:

CRITICAL: IF USER SPEECH IS INCOMPLETE BASED ON CONTEXT, 
RESPOND WITH ONLY A SINGLE SPACE (NO QUOTES, NO OTHER 
CHARACTERS, JUST THE SPACE) - THIS IS THE HIGHEST 
PRIORITY RULE

Why This Works

When your LLM detects incomplete speech and returns a single space:

  • ElevenLabs TTS doesn't generate any audible sound
  • The conversation flow remains natural
  • Users get the thinking time they need

But here's the thing - it's still an LLM making this decision, so nobody really knows where it might break or fail. LLMs can be unpredictable, and sometimes they might misinterpret complete sentences as incomplete or vice versa. It's not a perfect solution, but it significantly improved our user experience.

The Indian Context Challenge

Working with Indian users adds another layer of complexity to turn detection. Our users frequently:

  • Switch between Hindi and English mid-conversation
  • Take longer pauses when translating thoughts
  • Use different speech patterns and rhythms

Current turn detection models, primarily trained on Western conversation patterns, struggle with these nuances. This highlights the need for more culturally and linguistically diverse training data in voice AI systems.

The Bigger Picture

This deep dive into LiveKit's turn detection reveals a broader truth about voice AI: the technology is incredibly sophisticated, yet still has significant room for improvement. The difference between a good voice AI and a great one often lies in these subtle interaction details.

For developers building voice applications, especially for diverse markets like India, understanding these internals isn't just academic – it's essential for creating truly natural conversational experiences.

Wrapping Up

LiveKit's turn detection system is a impressive piece of engineering, but like all AI systems, it has limitations. By understanding how it works under the hood, we can better optimize our applications and create more natural voice interactions.

The journey continues, and I'm excited to share more findings as we push the boundaries of what's possible with voice AI in healthcare and beyond.

Have you encountered similar challenges with turn detection in your voice AI projects? I'd love to hear about your experiences and solutions. Connect with me to continue the conversation about building better voice experiences for Indian users and beyond.