斷裂的思維鏈推理

摘要

推理時期的縮放技術，通過在無需重新訓練的情況下，於推理階段投入額外的計算資源，顯著增強了大規模語言模型（LLMs）的推理能力。同樣地，思維鏈（Chain-of-Thought, CoT）提示及其延伸——長思維鏈（Long CoT），通過生成豐富的中間推理軌跡來提升準確性，但這些方法伴隨著高昂的令牌成本，阻礙了其在對延遲敏感的場景中的部署。本研究中，我們首先展示了截斷式思維鏈（truncated CoT），即在推理完成前停止並直接生成最終答案，往往能與完整思維鏈採樣相媲美，同時大幅減少令牌使用。基於這一發現，我們提出了分層採樣（Fractured Sampling），這是一種統一的推理時期策略，它在完整思維鏈與僅解答採樣之間沿著三個正交維度進行插值：（1）推理軌跡的數量，（2）每條軌跡最終解答的數量，以及（3）推理痕跡被截斷的深度。通過在五個多樣化的推理基準測試及多個模型規模上的廣泛實驗，我們證明了分層採樣在準確性與成本之間始終能達成更優的平衡，在Pass@k對令牌預算的關係中展現出陡峭的對數線性縮放增益。我們的分析揭示了如何在這幾個維度上分配計算資源以最大化性能，為實現更高效、可擴展的LLM推理鋪平了道路。

English

Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning.

斷裂的思維鏈推理

Fractured Chain-of-Thought Reasoning

摘要

Support