LIMOPro: 効率的かつ効果的なテスト時スケーリングのための推論改良

要旨

大規模言語モデル（LLM）は、テスト時のスケーリング手法を通じて顕著な推論能力を示しており、特に強力な大規模推論モデル（LRM）から蒸留された連鎖的思考（CoT）データでファインチューニングされた場合にその能力が発揮されます。しかし、これらの推論連鎖には、人間の問題解決を反映した冗長な要素が含まれることが多く、それらは進行的推論（本質的な解決策の開発経路）と機能的な要素（検証プロセス、代替解決策のアプローチ、エラー修正）に分類されます。進行的推論は重要ですが、機能的な要素はテスト時の推論において計算負荷を大幅に増加させます。本論文では、PIR（Perplexity-based Importance Refinement）を提案します。これは、各推論ステップの重要性を、回答予測の信頼度への影響に基づいて定量的に評価する原則的なフレームワークです。PIRは、低重要度の機能的なステップを体系的に識別し、選択的に刈り込みながら、進行的推論の要素を保持し、核心的な解決経路の整合性を維持しながら冗長性を削減した最適化されたトレーニングデータを生成します。PIR最適化データでファインチューニングされたモデルは、テスト時のスケーリング特性が優れており、より簡潔な推論連鎖を生成しながら、精度を向上させ（+0.9\% から +6.6\%）、トークン使用量を大幅に削減（-3\% から -41\%）することが、挑戦的な推論ベンチマーク（AIME、AMC、GPQA Diamond）で確認されました。本アプローチは、異なるモデルサイズ、データソース、トークン予算において強い汎用性を示し、効率的なテスト時のスケーリング、応答時間、計算効率が重要な制約となるシナリオでの推論能力を持つLLMの実用的なソリューションを提供します。

English

Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.

LIMOPro: 効率的かつ効果的なテスト時スケーリングのための推論改良

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

要旨

Support