LADDER:通過遞歸問題分解實現大型語言模型的自我提升
LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
March 2, 2025
作者: Toby Simonds, Akira Yoshiyama
cs.AI
摘要
我們介紹了LADDER(通過自主難度驅動的示例遞歸學習)框架,該框架使大型語言模型能夠通過自我引導學習,遞歸生成並解決複雜問題的逐步簡化變體,從而自主提升其問題解決能力。與以往需要精心策劃數據集或人類反饋的方法不同,LADDER利用模型自身的能力來生成更簡單的問題變體。我們在數學積分領域展示了LADDER的有效性,將Llama 3.2 3B在大學水平問題上的準確率從1%提升至82%,並使Qwen2.5 7B Deepseek-R1 Distilled在MIT積分蜜蜂資格考試中達到73%的準確率。此外,我們還引入了TTRL(測試時強化學習),在推理時對測試問題的變體進行強化學習。TTRL使Qwen2.5 7B Deepseek-R1 Distilled在MIT積分蜜蜂資格考試中取得了90%的頂尖成績,超越了OpenAI o1的表現。這些結果表明,自我導向的戰略學習能夠在不依賴架構擴展或人類監督的情況下,實現顯著的能力提升。
English
We introduce LADDER (Learning through Autonomous Difficulty-Driven Example
Recursion), a framework which enables Large Language Models to autonomously
improve their problem-solving capabilities through self-guided learning by
recursively generating and solving progressively simpler variants of complex
problems. Unlike prior approaches that require curated datasets or human
feedback, LADDER leverages a model's own capabilities to generate easier
question variants. We demonstrate LADDER's effectiveness in the subject of
mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on
undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to
achieve 73% on the MIT Integration Bee qualifying examination. We also
introduce TTRL (Test-Time Reinforcement Learning), where we perform
reinforcement learning on variants of test problems at inference time. TTRL
enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of
90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's
performance. These results show how self-directed strategic learning can
achieve significant capability improvements without relying on architectural
scaling or human supervision.Summary
AI-Generated Summary