リアルタイム音声エージェント向けタイ語セマンティックターン終了検出

要旨

流暢な音声対音声インタラクションを実現するためには、ユーザーが話し終えたタイミングを確実かつ低遅延で検出する必要があります。従来の音声無音区間検出方式では数百ミリ秒の遅延が生じ、また、ためらいや言語固有の現象に対してうまく機能しません。本研究では、リアルタイムエージェント向けのタイ語テキストのみを用いたターン終了（EOT）検出に関する、我々の知る限り初の体系的な研究を提示します。コンパクトなLLMのゼロショットおよび少数ショットプロンプティングと、軽量なTransformerの教師ありファインチューニングを比較します。YODASコーパスからの文字起こし字幕とタイ語固有の言語的指標（例：文末助詞）を活用し、EOTをトークン境界上の二値決定問題として定式化します。精度と遅延の明確なトレードオフを報告し、公開可能な実装計画を提供します。本研究成果はタイ語におけるベースラインを確立し、デバイス上エージェントに適したほぼ即時のEOT決定を、小型でファインチューニングされたモデルが実現可能であることを示しています。

English

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

リアルタイム音声エージェント向けタイ語セマンティックターン終了検出

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

要旨

Support