在全雙工語音模型中釋放大型語言模型的能力

摘要

基於語音的大型語言模型通常受限於口語回應，這使得它們面向用戶的輸出只能局限於可被口頭表達的內容，並抑制了文字原生的能力，例如程式碼生成、結構化分析及即時互動中的多步驟推理，尤其對於需要持久、結構化且可檢查的中間輸出的任務。現有的研究改進了口語推理或全雙工輪流發言，但仍將文字視為隱藏的中間狀態或次要模態，而非第一類輸出通道。我們提出「聽-寫-說」（Listen-Write-Speak, LWS）——一種以文字為優先的三通道範式，其中單一自回歸大型語言模型持續聆聽使用者語音，同時以可見的自由形式文字作為主要輸出，並在共享的因果注意力上下文下平行產生即時口語回應。此行為完全透過「令牌結構」（Token Schema）實現，無需修改模型架構，並經由兩階段資料處理流程學習，該流程合成與揭示的輸入時間軸一致的每秒認知註解。實驗結果顯示，LWS 在全雙工基準（Full-Duplex-Bench）上展現強大的全雙工互動能力，在 VoiceBench AlpacaEval 上達到 4.72 分，寫說一致性達 92.6%，並在 URO-Bench 上持續優於其內部消融模型。這些結果表明，可見的文字書寫可以作為語音互動的第一類輸出通道，且無需犧牲即時回應能力。程式碼與資料集已公開於專案頁面：https://royalzhang.com/project/lws-page/。

English

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.