
Our voice agents were working well functionally, but the conversational flow felt unnatural. Users would pause to think or search for the right words (many of our users switch between Hindi and English), and our agent would jump in prematurely. These interruptions weren't just annoying – they were breaking the natural flow of conversation and impacting user experience.
Determined to solve this, our team dove deep into LiveKit's SDK source code. Here's what we found lurking in the plugins directory – the turn detection system that powers millions of voice interactions.
LiveKit provides two main turn detection models:
After analyzing the code, we uncovered how LiveKit's turn detection actually works:
LiveKit uses a clever dual-delay mechanism:
When EOU probability > threshold:
min_endpoint_delayWhen EOU probability < threshold:
max_endpoint_delay (usually 800-1200ms)Think of it like a careful conversational partner - when they're pretty sure you're done talking, they jump in quickly. When they're not sure, they wait a bit longer to see if you'll continue.
The fascinating (and problematic) part? LiveKit uses surprisingly low thresholds for triggering the faster min_endpoint_delay. This aggressive approach explains why interruptions happen so frequently - the system is too eager to classify utterances as "complete" and jumps in with the shorter delay.
Here's where it gets interesting (and frustrating). While LiveKit provides model weights on GitHub, they don't share the complete training pipeline or architecture details. This black-box approach makes it difficult for developers to truly understand and optimize the system for their specific use cases.
Initially, we suspected that STT-generated punctuation was confusing the turn detection model. However, our investigation revealed that LiveKit actually strips out punctuation before processing! This means the model is purely relying on semantic understanding of context and meaning – which is both impressive and explains some of the challenges.
The model doesn't just look at the current sentence in isolation. It considers conversation context to make turn detection decisions. This explains why it generally works well for standard conversations but struggles with domain-specific interactions like healthcare consultations.
LiveKit is actively improving these models. The system is clearly a work in progress, with regular updates to improve accuracy.
While we wait for better turn detection models, there is a practical prompt engineering solution that significantly reduce interruptions:
CRITICAL: IF USER SPEECH IS INCOMPLETE BASED ON CONTEXT,
RESPOND WITH ONLY A SINGLE SPACE (NO QUOTES, NO OTHER
CHARACTERS, JUST THE SPACE) - THIS IS THE HIGHEST
PRIORITY RULE
When your LLM detects incomplete speech and returns a single space:
But here's the thing - it's still an LLM making this decision, so nobody really knows where it might break or fail. LLMs can be unpredictable, and sometimes they might misinterpret complete sentences as incomplete or vice versa. It's not a perfect solution, but it significantly improved our user experience.
Working with Indian users adds another layer of complexity to turn detection. Our users frequently:
Current turn detection models, primarily trained on Western conversation patterns, struggle with these nuances. This highlights the need for more culturally and linguistically diverse training data in voice AI systems.
This deep dive into LiveKit's turn detection reveals a broader truth about voice AI: the technology is incredibly sophisticated, yet still has significant room for improvement. The difference between a good voice AI and a great one often lies in these subtle interaction details.
For developers building voice applications, especially for diverse markets like India, understanding these internals isn't just academic – it's essential for creating truly natural conversational experiences.
LiveKit's turn detection system is a impressive piece of engineering, but like all AI systems, it has limitations. By understanding how it works under the hood, we can better optimize our applications and create more natural voice interactions.
The journey continues, and I'm excited to share more findings as we push the boundaries of what's possible with voice AI in healthcare and beyond.
Have you encountered similar challenges with turn detection in your voice AI projects? I'd love to hear about your experiences and solutions. Connect with me to continue the conversation about building better voice experiences for Indian users and beyond.