释放大语言模型在全双工语音模型中的潜能

摘要

基于语音的大语言模型通常局限于生成口语回复，这使得其面向用户的输出仅限于可被口头表达的内容，并压制了文本原生能力，例如代码生成、结构化分析以及需要持久、可检查和可追踪的中间输出的实时交互中的多步推理。现有工作改进了口语推理或全双工轮换机制，但仍将文本视为隐藏的中间状态或从属模态，而非第一类输出通道。我们提出了听-写-说（LWS）这一以文本为先的三通道范式，其中单一的自回归大语言模型持续监听用户语音，将可见的自由格式文本作为其主要输出，并在共享的因果注意力上下文下并行生成实时口语回复。这一行为完全通过令牌模式实现，无需修改架构，并通过两阶段数据流水线学习，该流水线合成与已揭示输入时间线一致的每秒认知标注。实验证明，LWS在全双工基准测试（Full-Duplex-Bench）上展现出强大的全双工交互能力，在VoiceBench AlpacaEval上达到4.72分，写作-口语一致性达92.6%，并在URO-Bench上持续优于其内部消融版本。这些结果表明，可见的写作可以作为语音交互的第一类输出通道，且无需牺牲实时响应性。代码与数据集可在项目页面获取：https://royalzhang.com/project/lws-page/。

English

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.