실시간 음성 에이전트를 위한 태국어 의미론적 턴 종료 감지

초록

유창한 음성 간 상호작용을 위해서는 사용자가 말을 마쳤을 때를 신뢰할 수 있고 낮은 지연 시간으로 탐지하는 것이 필요합니다. 기존의 오디오 무음 종료 탐지기는 수백 밀리초의 지연을 추가하며, 망설임이나 언어 특수 현상에서 실패합니다. 우리는 실시간 에이전트를 위한 태국어 텍스트 전용 턴 종료(EOT) 탐지에 대한 첫 체계적인 연구를 제시합니다. 우리는 컴팩트한 LLM의 제로샷 및 퓨샷 프롬프팅과 경량 트랜스포머의 지도 미세 조정을 비교합니다. YODAS 코퍼스의 자막과 태국어 특유의 언어적 단서(예: 문장 종결 어미)를 사용하여, EOT를 토큰 경계에 대한 이진 결정으로 공식화합니다. 우리는 정확도와 지연 시간 간의 명확한 트레이드오프를 보고하며, 공개 가능한 구현 계획을 제공합니다. 이 연구는 태국어 기준선을 확립하고, 소형 미세 조정 모델이 온디바이스 에이전트에 적합한 즉각적인 EOT 결정을 제공할 수 있음을 입증합니다.

English

Fluid voice-to-voice interaction requires reliable and low-latency detection of when a user has finished speaking. Traditional audio-silence end-pointers add hundreds of milliseconds of delay and fail under hesitations or language-specific phenomena. We present, to our knowledge, the first systematic study of Thai text-only end-of-turn (EOT) detection for real-time agents. We compare zero-shot and few-shot prompting of compact LLMs to supervised fine-tuning of lightweight transformers. Using transcribed subtitles from the YODAS corpus and Thai-specific linguistic cues (e.g., sentence-final particles), we formulate EOT as a binary decision over token boundaries. We report a clear accuracy-latency tradeoff and provide a public-ready implementation plan. This work establishes a Thai baseline and demonstrates that small, fine-tuned models can deliver near-instant EOT decisions suitable for on-device agents.

실시간 음성 에이전트를 위한 태국어 의미론적 턴 종료 감지

Thai Semantic End-of-Turn Detection for Real-Time Voice Agents

초록

Support