在全雙工語音模型中釋放大型語言模型的能力
Liberating LLM Capabilities in Full-Duplex Speech Models
May 4, 2026
作者: Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao
cs.AI
摘要
基於語音的大型語言模型通常受限於口語回應,這使得它們面向用戶的輸出只能局限於可被口頭表達的內容,並抑制了文字原生的能力,例如程式碼生成、結構化分析及即時互動中的多步驟推理,尤其對於需要持久、結構化且可檢查的中間輸出的任務。現有的研究改進了口語推理或全雙工輪流發言,但仍將文字視為隱藏的中間狀態或次要模態,而非第一類輸出通道。我們提出「聽-寫-說」(Listen-Write-Speak, LWS)——一種以文字為優先的三通道範式,其中單一自回歸大型語言模型持續聆聽使用者語音,同時以可見的自由形式文字作為主要輸出,並在共享的因果注意力上下文下平行產生即時口語回應。此行為完全透過「令牌結構」(Token Schema)實現,無需修改模型架構,並經由兩階段資料處理流程學習,該流程合成與揭示的輸入時間軸一致的每秒認知註解。實驗結果顯示,LWS 在全雙工基準(Full-Duplex-Bench)上展現強大的全雙工互動能力,在 VoiceBench AlpacaEval 上達到 4.72 分,寫說一致性達 92.6%,並在 URO-Bench 上持續優於其內部消融模型。這些結果表明,可見的文字書寫可以作為語音互動的第一類輸出通道,且無需犧牲即時回應能力。程式碼與資料集已公開於專案頁面:https://royalzhang.com/project/lws-page/。
English
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.