LADDER：再帰的問題分解による自己改善型大規模言語モデル

要旨

我々はLADDER（Learning through Autonomous Difficulty-Driven Example Recursion）を紹介する。これは、大規模言語モデルが複雑な問題の段階的に単純化されたバリエーションを再帰的に生成し解決することで、自律的に問題解決能力を向上させるフレームワークである。従来のアプローチでは精選されたデータセットや人間のフィードバックが必要であったが、LADDERはモデル自身の能力を活用してより簡単な問題バリエーションを生成する。我々は数学的積分の分野でLADDERの有効性を実証し、Llama 3.2 3Bの大学レベルの問題に対する正答率を1%から82%に向上させ、Qwen2.5 7B Deepseek-R1 DistilledがMIT Integration Bee予選試験で73%を達成できることを示した。また、TTRL（Test-Time Reinforcement Learning）を導入し、推論時にテスト問題のバリエーションに対して強化学習を実行する。TTRLにより、Qwen2.5 7B Deepseek-R1 DistilledはMIT Integration Bee予選試験で90%という最先端のスコアを達成し、OpenAI o1の性能を上回った。これらの結果は、アーキテクチャのスケーリングや人間の監督に依存せずに、自己主導型の戦略的学習が大幅な能力向上を実現できることを示している。

English

We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework which enables Large Language Models to autonomously improve their problem-solving capabilities through self-guided learning by recursively generating and solving progressively simpler variants of complex problems. Unlike prior approaches that require curated datasets or human feedback, LADDER leverages a model's own capabilities to generate easier question variants. We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to achieve 73% on the MIT Integration Bee qualifying examination. We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance. These results show how self-directed strategic learning can achieve significant capability improvements without relying on architectural scaling or human supervision.

LADDER：再帰的問題分解による自己改善型大規模言語モデル

LADDER: Self-Improving LLMs Through Recursive Problem Decomposition

要旨

Support