何时思考，何时言说：大语言模型推理中的信息透露策略学习

摘要

在单流自回归交互界面中，相同标记既更新模型状态又构成不可撤销的公开承诺。这种耦合产生了静默税：额外思考会推迟首个任务相关内容的生成，而草率的早期输出则可能因未成熟承诺导致后续生成产生偏差。我们提出并行交错推理（SxS）机制，将披露时机转化为标准自回归生成中的可控决策。SxS在相同上下文中交错进行部分内容披露与持续私有推理，但仅当内容获得当前推理支持时才予以发布。为避免填充词滥用并掌握这种节奏控制，我们通过匹配答案前缀与支撑推理前缀构建蕴含对齐的交错轨迹，继而采用SFT训练获取双行动语义，并通过RL训练在新格式下恢复推理性能。在Qwen3两种架构/规模（混合专家Qwen3-30B-A3B、稠密模型Qwen3-4B）及领域内（AIME25）/领域外（GPQA-Diamond）基准测试中，SxS在词元级代理指标（如更新间隔等待时间）下显著优化了准确率-内容延迟的帕累托权衡。

English

In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a silence tax: additional deliberation postpones the first task-relevant content, while naive early streaming risks premature commitments that bias subsequent generations. We introduce Side-by-Side (SxS) Interleaved Reasoning, which makes disclosure timing a controllable decision within standard autoregressive generation. SxS interleaves partial disclosures with continued private reasoning in the same context, but releases content only when it is supported by the reasoning so far. To learn such pacing without incentivizing filler, we construct entailment-aligned interleaved trajectories by matching answer prefixes to supporting reasoning prefixes, then train with SFT to acquire the dual-action semantics and RL to recover reasoning performance under the new format. Across two Qwen3 architectures/scales (MoE Qwen3-30B-A3B, dense Qwen3-4B) and both in-domain (AIME25) and out-of-domain (GPQA-Diamond) benchmarks, SxS improves accuracy--content-latency Pareto trade-offs under token-level proxies such as inter-update waiting.

何时思考，何时言说：大语言模型推理中的信息透露策略学习

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

摘要

Support