泰語語義話輪終結檢測即時語音代理

摘要

流暢的語音互動需要可靠且低延遲的用戶說話結束檢測。傳統的基於音頻靜音的端點檢測方法會增加數百毫秒的延遲，並且在用戶猶豫或遇到特定語言現象時容易失效。據我們所知，我們首次系統性地研究了泰語純文本的實時對話輪次結束（EOT）檢測。我們比較了緊湊型大語言模型（LLM）的零樣本和少樣本提示方法，以及輕量級變換器的監督微調方法。利用YODAS語料庫中的轉錄字幕和泰語特有的語言線索（如句末助詞），我們將EOT檢測建模為基於詞元邊界的二元決策。我們報告了準確性與延遲之間的明顯權衡，並提供了一個可公開使用的實現方案。這項工作建立了泰語的基準，並展示了經過微調的小型模型能夠提供近乎即時的EOT決策，適用於設備端代理。

English

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

泰語語義話輪終結檢測即時語音代理

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

摘要

Support