全二重音声モデルにおけるLLM能力の解放

要旨

音声ベースの大規模言語モデルは通常、音声応答に制約されており、ユーザー向けの出力は音声化可能なものに限定され、コード生成、構造化分析、リアルタイム対話における多段階推論など、持続的かつ構造化され検査可能な中間出力を必要とするタスクにおいて、テキスト本来の能力が抑制される。既存研究では音声推論や全二重ターンテイキングの改善が進められているが、テキストは依然として隠れた中間状態や従属的なモダリティとして扱われ、第一級の出力チャネルとはなっていない。本稿では、Listen-Write-Speak（LWS）を提案する。これはテキスト優先の三チャネルパラダイムであり、単一の自己回帰型LLMがユーザーの音声を継続的に聞き取り、可視の自由形式テキストを主要出力として書き出し、同時に共有の因果的注意コンテキストの下でリアルタイムの音声応答を発話する。この動作はToken Schema（トークンスキーマ）によって完全に実装されており、アーキテクチャの変更を必要とせず、二段階のデータパイプラインを通じて学習される。このパイプラインでは、明らかになった入力タイムラインに整合した秒単位の認知アノテーションを合成する。実験的に、LWSはFull-Duplex-Benchで強力な全二重対話性能を示し、VoiceBench AlpacaEvalで4.72、書き発話一貫性で92.6%を達成し、URO-Benchにおいて内部アブレーションを一貫して上回った。これらの結果は、可視の書き出しがリアルタイム応答性を犠牲にすることなく、音声対話における第一級の出力チャネルとして機能し得ることを示している。コードとデータセットはプロジェクトページ（https://royalzhang.com/project/lws-page/）で公開されている。

English

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.