思考すべき時、語るべき時：大規模言語モデルの推論における開示方策の学習

要旨

シングルストリーム自己回帰インターフェースでは、同一のトークンがモデル状態の更新と、不可逆的な公開コミットメントの構成の両方に用いられる。この結合により「沈黙コスト」が生じる：追加の熟考は最初のタスク関連コンテンツの公開を遅らせる一方、安易な早期ストリーミングは後続の生成を偏らせる早期のコミットメントリスクをもたらす。本稿では、公開タイミングを標準的な自己回帰生成内で制御可能な判断とする**並列インターリーブ推論（Side-by-Side; SxS）**を提案する。SxSは部分的な公開と継続的な非公開推論を同一コンテキスト内で交互に行うが、コンテンツはこれまでの推論によって裏付けられた時点でのみ公開する。無駄な表現を助長せずにこのペーシングを学習させるため、回答プレフィックスとそれを支持する推論プレフィックスを対応づけることで含意関係に整合したインターリーブ軌道を構築し、SFTで双対アクションの意味論を学習させた後、RLを用いて新形式下での推論性能を回復させる。2つのQwen3アーキテクチャ/規模（MoE Qwen3-30B-A3B、密結合Qwen3-4B）およびイン領域（AIME25）と外部領域（GPQA-Diamond）の両ベンチマークにおいて、SxSは更新間待機時間などのトークンレベル代理指標における精度とコンテンツレイテンシのパレートトレードオフを改善する。

English

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS) Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.

思考すべき時、語るべき時：大規模言語モデルの推論における開示方策の学習

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

要旨

Support