ChatPaper.aiChatPaper

释放大语言模型在全双工语音模型中的潜能

Liberating LLM Capabilities in Full-Duplex Speech Models

May 4, 2026
作者: Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao
cs.AI

摘要

基于语音的大语言模型通常局限于生成口语回复,这使得其面向用户的输出仅限于可被口头表达的内容,并压制了文本原生能力,例如代码生成、结构化分析以及需要持久、可检查和可追踪的中间输出的实时交互中的多步推理。现有工作改进了口语推理或全双工轮换机制,但仍将文本视为隐藏的中间状态或从属模态,而非第一类输出通道。我们提出了听-写-说(LWS)这一以文本为先的三通道范式,其中单一的自回归大语言模型持续监听用户语音,将可见的自由格式文本作为其主要输出,并在共享的因果注意力上下文下并行生成实时口语回复。这一行为完全通过令牌模式实现,无需修改架构,并通过两阶段数据流水线学习,该流水线合成与已揭示输入时间线一致的每秒认知标注。实验证明,LWS在全双工基准测试(Full-Duplex-Bench)上展现出强大的全双工交互能力,在VoiceBench AlpacaEval上达到4.72分,写作-口语一致性达92.6%,并在URO-Bench上持续优于其内部消融版本。这些结果表明,可见的写作可以作为语音交互的第一类输出通道,且无需牺牲实时响应性。代码与数据集可在项目页面获取:https://royalzhang.com/project/lws-page/。
English
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.