断裂式思维链推理

摘要

推理时缩放技术通过在不重新训练的情况下利用额外的计算资源，显著增强了大型语言模型（LLMs）的推理能力。同样，思维链（CoT）提示及其扩展形式——长思维链，通过生成丰富的中间推理轨迹来提高准确性，但这些方法带来了巨大的令牌成本，阻碍了其在延迟敏感场景中的部署。在本研究中，我们首先展示了截断式思维链，即在推理完成前停止并直接生成最终答案，往往能与完整思维链采样相媲美，同时大幅减少令牌使用。基于这一洞察，我们引入了分片采样，这是一种统一的推理时策略，它在完整思维链和仅解决方案采样之间沿三个正交轴进行插值：（1）推理轨迹的数量，（2）每条轨迹的最终解决方案数量，以及（3）推理轨迹被截断的深度。通过在五个多样化推理基准和多个模型规模上的广泛实验，我们证明了分片采样始终能实现更优的准确性与成本权衡，在Pass@k与令牌预算之间呈现出陡峭的对数线性缩放增益。我们的分析揭示了如何在这些维度上分配计算资源以最大化性能，为更高效、可扩展的LLM推理铺平了道路。

English

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.

断裂式思维链推理

Fractured Chain-of-Thought Reasoning

摘要

Support