RLAD:訓練大型語言模型以發現解決推理問題的抽象方法
RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
October 2, 2025
作者: Yuxiao Qu, Anikait Singh, Yoonho Lee, Amrith Setlur, Ruslan Salakhutdinov, Chelsea Finn, Aviral Kumar
cs.AI
摘要
推理需要超越模式匹配或解决方案的记忆,以识别并实施能够用于推导难题答案的“算法程序”。实现这一点,需识别最相关的原语、中间结果或共享程序,并在此基础上构建。尽管强化学习(RL)在长链思维训练后最终旨在揭示此类算法行为,但大型模型学习到的大多数推理轨迹未能一致地捕捉或重用程序,反而陷入冗长且退化的探索之中。为促进更有效的推理,我们引入了推理抽象:即对程序性和事实性知识的简洁自然语言描述,这些描述引导模型学习成功的推理。我们训练模型使其能够在给定问题时提出多种抽象,随后通过强化学习激励在利用这些抽象提供的信息基础上构建解决方案。这形成了一种双玩家RL训练范式,简称RLAD,它联合训练一个抽象生成器和一个解决方案生成器。此设置有效地实现了结构化探索,解耦了抽象提议与解决方案生成的学习信号,并提升了对更难题目的泛化能力。我们还表明,在较大的测试预算下,将更多测试时间计算资源用于生成抽象比生成更多解决方案更有利于性能提升,这说明了抽象在引导有意义探索中的重要作用。
English
Reasoning requires going beyond pattern matching or memorization of solutions
to identify and implement "algorithmic procedures" that can be used to deduce
answers to hard problems. Doing so requires realizing the most relevant
primitives, intermediate results, or shared procedures, and building upon them.
While RL post-training on long chains of thought ultimately aims to uncover
this kind of algorithmic behavior, most reasoning traces learned by large
models fail to consistently capture or reuse procedures, instead drifting into
verbose and degenerate exploration. To address more effective reasoning, we
introduce reasoning abstractions: concise natural language descriptions of
procedural and factual knowledge that guide the model toward learning
successful reasoning. We train models to be capable of proposing multiple
abstractions given a problem, followed by RL that incentivizes building a
solution while using the information provided by these abstractions. This
results in a two-player RL training paradigm, abbreviated as RLAD, that jointly
trains an abstraction generator and a solution generator. This setup
effectively enables structured exploration, decouples learning signals of
abstraction proposal and solution generation, and improves generalization to
harder problems. We also show that allocating more test-time compute to
generating abstractions is more beneficial for performance than generating more
solutions at large test budgets, illustrating the role of abstractions in
guiding meaningful exploration.