전이중 음성 모델에서 LLM 역량의 해방

초록

음성 기반 대규모 언어 모델은 일반적으로 음성 응답으로 제한되어, 사용자에게 제공되는 출력이 발화 가능한 내용으로 한정되고, 지속적이고 구조화된 검증 가능한 중간 출력이 필요한 작업에서 실시간 상호작용 중 코드 생성, 구조화된 분석, 다단계 추론과 같은 텍스트 고유의 기능을 억제한다. 기존 연구들은 음성 추론 또는 전이중 턴테이킹을 개선했지만, 여전히 텍스트를 숨겨진 중간 상태나 하위 양식으로 취급할 뿐 일급 출력 채널로 간주하지 않는다. 본 논문에서는 청취-기록-발화(LWS)라는 텍스트 우선 삼중 채널 패러다임을 제안한다. 이 패러다임에서 단일 자기회귀 LLM은 사용자 오디오를 지속적으로 청취하고, 가시적인 자유 형식 텍스트를 주 출력으로 기록하며, 공유된 인과적 주의 맥락 아래에서 실시간 음성 응답을 병렬로 발화한다. 이 동작은 전적으로 토큰 스키마를 통해 구현되며, 아키텍처 수정이 필요하지 않고, 공개된 입력 타임라인과 일치하는 초 단위 인지 주석을 합성하는 2단계 데이터 파이프라인을 통해 학습된다. 실험적으로 LWS는 Full-Duplex-Bench에서 강력한 전이중 상호작용을 보여주고, VoiceBench AlpacaEval에서 4.72에 도달하며, 92.6%의 기록-발화 일관성을 달성하고, URO-Bench에서 내부 절제 실험을 일관되게 능가한다. 이러한 결과는 가시적인 기록이 실시간 응답성을 희생하지 않고 음성 상호작용의 일급 출력 채널로 기능할 수 있음을 시사한다. 코드와 데이터셋은 프로젝트 페이지(https://royalzhang.com/project/lws-page/)에서 확인할 수 있다.

English

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.