ChatPaper.aiChatPaper

LIMOPro:推理优化助力高效测试阶段扩展

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

May 25, 2025
作者: Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, Pengfei Liu
cs.AI

摘要

大型语言模型(LLMs)通过测试时扩展方法展现了卓越的推理能力,尤其是在使用从更强大的大型推理模型(LRMs)中提炼出的思维链(CoT)数据进行微调时。然而,这些推理链往往包含冗长的元素,模仿了人类解决问题的过程,可分为渐进推理(核心解决方案的发展路径)和功能性元素(验证过程、替代解决方案方法及错误修正)。尽管渐进推理至关重要,但功能性元素显著增加了测试时推理的计算负担。我们引入了PIR(基于困惑度的重要性精炼),这是一个原则性框架,它根据每个推理步骤对答案预测置信度的影响来定量评估其重要性。PIR系统地识别并选择性修剪仅低重要性的功能性步骤,同时保留渐进推理组件,从而创建优化的训练数据,既保持了核心解决路径的完整性,又减少了冗余。在PIR优化数据上微调的模型展现出更优的测试时扩展特性,生成更简洁的推理链,同时在具有挑战性的推理基准测试(AIME、AMC和GPQA Diamond)中实现了准确率的提升(+0.9%至+6.6%),并显著减少了令牌使用量(-3%至-41%)。我们的方法在不同模型规模、数据源和令牌预算下均表现出强大的泛化能力,为在测试时扩展效率、响应时间和计算效率作为宝贵约束的场景中部署具备推理能力的LLMs提供了实用解决方案。
English
Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

Summary

AI-Generated Summary

PDF122May 29, 2025