「アハ！」を超えて：大規模推論モデルにおける体系的なメタ能力アラインメントに向けて

要旨

大規模推論モデル（LRM）は、すでに長い連鎖思考推論の潜在能力を備えています。これまでの研究では、結果ベースの強化学習（RL）が、自己修正、バックトラッキング、検証といった高度な推論行動を偶発的に引き起こすことが示されており、これらはしばしばモデルの「ひらめきの瞬間」と呼ばれています。しかし、これらの創発的行動のタイミングと一貫性は予測不可能で制御不能であり、LRMの推論能力の拡張性と信頼性を制限しています。これらの制限に対処するため、我々はプロンプトや偶発的な「ひらめきの瞬間」への依存を超え、自動生成された自己検証可能なタスクを用いて、モデルを三段階のパイプライン（個別アライメント、パラメータ空間の統合、ドメイン固有の強化学習）で明示的にアライメントし、指示チューニングされたベースラインに対して10％以上の性能向上を実現しました。さらに、アライメントされたチェックポイントからのドメイン固有のRLは、数学、コーディング、科学のベンチマークで平均2％の性能上限の向上をもたらし、明示的なメタ能力アライメントが推論のための拡張可能で信頼性の高い基盤を提供することを示しています。コードは以下で公開されています：https://github.com/zhiyuanhubj/Meta-Ability-Alignment

English

Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

「アハ！」を超えて：大規模推論モデルにおける体系的なメタ能力アラインメントに向けて

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

要旨

Support