RLAD: 推論問題解決のための抽象化を発見するLLMのトレーニング

要旨

推論には、パターンマッチングや解決策の記憶を超えて、難しい問題に対する答えを導出するために使用できる「アルゴリズム的手順」を特定し、実装することが求められる。これを行うためには、最も関連性の高いプリミティブ、中間結果、または共有手順を認識し、それらを基に構築する必要がある。長い思考の連鎖に対するRL（強化学習）の事後トレーニングは、最終的にこの種のアルゴリズム的挙動を解明することを目指しているが、大規模モデルが学習する推論の軌跡のほとんどは、手順を一貫して捕捉または再利用することに失敗し、冗長で退行的な探索に陥ってしまう。より効果的な推論を実現するために、我々は推論の抽象化を導入する：これは、手続き的および事実的知識を簡潔に記述した自然言語であり、モデルが成功する推論を学習するよう導く。我々は、問題に対して複数の抽象化を提案できるモデルを訓練し、その後、これらの抽象化が提供する情報を活用しながら解決策を構築することを奨励するRLを適用する。これにより、抽象化生成器と解決策生成器を共同で訓練する二プレイヤーRLトレーニングパラダイム（RLADと略称）が実現される。この設定は、構造化された探索を効果的に可能にし、抽象化提案と解決策生成の学習信号を分離し、より難しい問題への一般化を改善する。また、テスト時の計算リソースを抽象化の生成に多く割り当てることが、大規模なテスト予算においてより多くの解決策を生成するよりも性能向上に寄与することを示し、抽象化が意味のある探索を導く役割を果たすことを示している。

English

Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.

RLAD: 推論問題解決のための抽象化を発見するLLMのトレーニング

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

要旨

Support