泰語語義話輪終結檢測即時語音代理
Thai Semantic End-of-Turn Detection for Real-Time Voice Agents
October 5, 2025
作者: Thanapol Popit, Natthapath Rungseesiripak, Monthol Charattrakool, Saksorn Ruangtanusak
cs.AI
摘要
流暢的語音互動需要可靠且低延遲的用戶說話結束檢測。傳統的基於音頻靜音的端點檢測方法會增加數百毫秒的延遲,並且在用戶猶豫或遇到特定語言現象時容易失效。據我們所知,我們首次系統性地研究了泰語純文本的實時對話輪次結束(EOT)檢測。我們比較了緊湊型大語言模型(LLM)的零樣本和少樣本提示方法,以及輕量級變換器的監督微調方法。利用YODAS語料庫中的轉錄字幕和泰語特有的語言線索(如句末助詞),我們將EOT檢測建模為基於詞元邊界的二元決策。我們報告了準確性與延遲之間的明顯權衡,並提供了一個可公開使用的實現方案。這項工作建立了泰語的基準,並展示了經過微調的小型模型能夠提供近乎即時的EOT決策,適用於設備端代理。
English
Fluid voice-to-voice interaction requires reliable and low-latency detection
of when a user has finished speaking. Traditional audio-silence end-pointers
add hundreds of milliseconds of delay and fail under hesitations or
language-specific phenomena. We present, to our knowledge, the first systematic
study of Thai text-only end-of-turn (EOT) detection for real-time agents. We
compare zero-shot and few-shot prompting of compact LLMs to supervised
fine-tuning of lightweight transformers. Using transcribed subtitles from the
YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final
particles), we formulate EOT as a binary decision over token boundaries. We
report a clear accuracy-latency tradeoff and provide a public-ready
implementation plan. This work establishes a Thai baseline and demonstrates
that small, fine-tuned models can deliver near-instant EOT decisions suitable
for on-device agents.