AdaCtrl：難易度を考慮した予算配分による適応的かつ制御可能な推論に向けて

要旨

現代の大規模推論モデルは、高度な推論戦略を採用することで印象的な問題解決能力を示しています。しかし、効率性と有効性のバランスを取ることに苦戦し、単純な問題に対して不必要に長い推論チェーンを生成することが頻繁にあります。本研究では、難易度を認識した適応的推論予算配分と、推論深度に対する明示的なユーザー制御の両方をサポートする新しいフレームワークであるAdaCtrlを提案します。AdaCtrlは、自己評価した問題の難易度に基づいて推論長を動的に調整すると同時に、ユーザーが手動で予算を制御して効率性または有効性を優先できるようにします。これは、2段階のトレーニングパイプラインによって実現されます。まず、自己認識した難易度に基づいて推論予算を調整する能力を習得するための初期コールドスタートのファインチューニング段階があり、次に、オンライントレーニング中に進化する能力に基づいてモデルの適応的推論戦略を洗練し、難易度評価を調整する難易度認識強化学習（RL）段階が続きます。直感的なユーザーインタラクションを可能にするために、予算制御のための自然なインターフェースとして機能する明示的な長さトリガータグを設計しました。実証結果は、AdaCtrlが推定された難易度に基づいて推論長を適応させることを示しています。ファインチューニングとRLを組み込んだ標準的なトレーニングベースラインと比較して、AdaCtrlはパフォーマンスの向上をもたらし、同時に、複雑な推論を必要とするより挑戦的なAIME2024およびAIME2025データセットでは応答長をそれぞれ10.06％および12.14％削減し、より簡潔な応答で十分なMATH500およびGSM8Kデータセットではそれぞれ62.05％および91.04％削減しました。さらに、AdaCtrlは推論予算に対する正確なユーザー制御を可能にし、特定のニーズに合わせた応答を提供します。

English

Modern large reasoning models demonstrate impressive problem-solving capabilities by employing sophisticated reasoning strategies. However, they often struggle to balance efficiency and effectiveness, frequently generating unnecessarily lengthy reasoning chains for simple problems. In this work, we propose AdaCtrl, a novel framework to support both difficulty-aware adaptive reasoning budget allocation and explicit user control over reasoning depth. AdaCtrl dynamically adjusts its reasoning length based on self-assessed problem difficulty, while also allowing users to manually control the budget to prioritize either efficiency or effectiveness. This is achieved through a two-stage training pipeline: an initial cold-start fine-tuning phase to instill the ability to self-aware difficulty and adjust reasoning budget, followed by a difficulty-aware reinforcement learning (RL) stage that refines the model's adaptive reasoning strategies and calibrates its difficulty assessments based on its evolving capabilities during online training. To enable intuitive user interaction, we design explicit length-triggered tags that function as a natural interface for budget control. Empirical results show that AdaCtrl adapts reasoning length based on estimated difficulty, compared to the standard training baseline that also incorporates fine-tuning and RL, it yields performance improvements and simultaneously reduces response length by 10.06% and 12.14% on the more challenging AIME2024 and AIME2025 datasets, which require elaborate reasoning, and by 62.05% and 91.04% on the MATH500 and GSM8K datasets, where more concise responses are sufficient. Furthermore, AdaCtrl enables precise user control over the reasoning budget, allowing for tailored responses to meet specific needs.