LIMOPro:推理精煉,實現高效能測試階段擴展
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
May 25, 2025
作者: Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu
cs.AI
摘要
大型語言模型(LLMs)通過測試時擴展方法展現了卓越的推理能力,尤其是在使用從更強大的大型推理模型(LRMs)中蒸餾出的思維鏈(CoT)數據進行微調時。然而,這些推理鏈通常包含冗餘元素,這些元素反映了人類解決問題的過程,可分為漸進式推理(核心解決方案的發展路徑)和功能性元素(驗證過程、替代解決方案方法及錯誤修正)。雖然漸進式推理至關重要,但功能性元素在測試時推理過程中顯著增加了計算需求。我們引入了PIR(基於困惑度的重要性精煉),這是一個基於原則的框架,它根據每個推理步驟對答案預測置信度的影響來定量評估其重要性。PIR系統地識別並選擇性地修剪僅低重要性的功能性步驟,同時保留漸進式推理組件,從而創建出優化的訓練數據,這些數據在保持核心解決路徑完整性的同時減少了冗餘。在PIR優化數據上微調的模型展現了優越的測試時擴展特性,生成更簡潔的推理鏈,同時在具有挑戰性的推理基準(AIME、AMC和GPQA Diamond)上實現了更高的準確率(+0.9%至+6.6%),並顯著降低了令牌使用量(-3%至-41%)。我們的方法在不同模型大小、數據來源和令牌預算下展現了強大的泛化能力,為在高效測試時擴展、響應時間和計算效率作為重要約束的場景中部署具備推理能力的LLMs提供了實用解決方案。
English
Large language models (LLMs) have demonstrated remarkable reasoning
capabilities through test-time scaling approaches, particularly when fine-tuned
with chain-of-thought (CoT) data distilled from more powerful large reasoning
models (LRMs). However, these reasoning chains often contain verbose elements
that mirror human problem-solving, categorized as progressive reasoning (the
essential solution development path) and functional elements (verification
processes, alternative solution approaches, and error corrections). While
progressive reasoning is crucial, the functional elements significantly
increase computational demands during test-time inference. We introduce PIR
(Perplexity-based Importance Refinement), a principled framework that
quantitatively evaluates the importance of each reasoning step based on its
impact on answer prediction confidence. PIR systematically identifies and
selectively prunes only low-importance functional steps while preserving
progressive reasoning components, creating optimized training data that
maintains the integrity of the core solution path while reducing verbosity.
Models fine-tuned on PIR-optimized data exhibit superior test-time scaling
properties, generating more concise reasoning chains while achieving improved
accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to
-41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond).
Our approach demonstrates strong generalizability across different model sizes,
data sources, and token budgets, offering a practical solution for deploying
reasoning-capable LLMs in scenarios where efficient test-time scaling, response
time, and computational efficiency are valuable constraints.Summary
AI-Generated Summary