何时思考，何时言说：大语言模型推理中的披露策略学习

摘要

在单流自回归交互界面中，相同的标记既更新模型状态又构成不可撤销的公开承诺。这种耦合产生了静默税：额外的思考会推迟首个任务相关内容的生成，而简单的早期流式输出则可能因过早承诺而影响后续生成内容。我们提出并行交错推理（SxS）方法，将披露时机转化为标准自回归生成中的可控决策。SxS在相同上下文中将部分披露与持续私密推理交错进行，但仅当内容获得当前推理支持时才予以发布。为避免助长填充内容并掌握这种节奏控制，我们通过匹配答案前缀与支撑推理前缀来构建蕴含对齐的交错轨迹，随后采用SFT训练以掌握双动作语义，并利用RL在新格式下恢复推理性能。在两种Qwen3架构/规模（混合专家模型Qwen3-30B-A3B、稠密模型Qwen3-4B）及领域内（AIME25）与领域外（GPQA-Diamond）基准测试中，SxS在词元级代理指标（如更新间隔等待时间）下显著改善了准确率-内容延迟的帕累托权衡关系。

English

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS) Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.

何时思考，何时言说：大语言模型推理中的披露策略学习

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

摘要

Support