正しく考える：適応的・注意的圧縮による過剰思考と思考不足の緩和を学ぶ

要旨

近年の思考モデルは、テスト時の計算リソースをスケーリングすることで複雑な推論タスクを解決するが、このスケーリングはタスクの難易度に応じて適切に配分されなければならない。一方で、短い推論（過少思考）は、長い推論ステップを必要とする難しい問題において誤りを引き起こす。しかし、過度に長い推論（過剰思考）はトークン効率が悪く、正しい中間解に到達した後も不要なステップを生成してしまう。これを「適応性の欠如」と呼び、モデルが問題の難易度に応じて応答の長さを適切に調整できない状態を指す。この適応性の欠如を解決し、過少思考と過剰思考のバランスを取るために、我々はTRAAC（Think Right with Adaptive, Attentive Compression）を提案する。TRAACは、オンラインの事後訓練強化学習（RL）手法であり、モデルの自己注意機構を長い推論軌跡に適用して重要なステップを特定し、冗長なステップを削除する。さらに、TRAACは難易度を推定し、それを訓練報酬に組み込むことで、例題の難易度に応じた推論予算の配分を学習する。我々のアプローチは、ベースモデルや他のRLベースラインと比較して、精度を向上させ、推論ステップを削減し、適応的な思考を可能にする。様々なタスク（AIME、AMC、GPQA-D、BBEH）において、TRAAC（Qwen3-4B）はベースモデルと比較して平均8.4%の絶対精度向上と36.8%の推論長短縮を達成し、最良のRLベースラインと比較して7.9%の精度向上と29.4%の長さ短縮を実現した。また、TRAACは強い汎化能力を示し、数学データセットで訓練されたモデルが、GPQA-D、BBEH、OptimalThinkingBenchといった分布外の非数学データセットにおいても精度と効率の向上を示した。さらに、我々の分析により、TRAACが難易度に基づいて思考予算を細かく調整し、タスク難易度の較正と注意ベースの圧縮を組み合わせることで、多様なタスクにおいて利得が得られることが確認された。

English

Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model's self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.

正しく考える：適応的・注意的圧縮による過剰思考と思考不足の緩和を学ぶ

Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression

要旨

Support