RLAD：訓練大型語言模型以發現解決推理問題的抽象方法

摘要

推理需要超越模式匹配或解决方案的记忆，以识别并实施能够用于推导难题答案的“算法程序”。实现这一点，需识别最相关的原语、中间结果或共享程序，并在此基础上构建。尽管强化学习（RL）在长链思维训练后最终旨在揭示此类算法行为，但大型模型学习到的大多数推理轨迹未能一致地捕捉或重用程序，反而陷入冗长且退化的探索之中。为促进更有效的推理，我们引入了推理抽象：即对程序性和事实性知识的简洁自然语言描述，这些描述引导模型学习成功的推理。我们训练模型使其能够在给定问题时提出多种抽象，随后通过强化学习激励在利用这些抽象提供的信息基础上构建解决方案。这形成了一种双玩家RL训练范式，简称RLAD，它联合训练一个抽象生成器和一个解决方案生成器。此设置有效地实现了结构化探索，解耦了抽象提议与解决方案生成的学习信号，并提升了对更难题目的泛化能力。我们还表明，在较大的测试预算下，将更多测试时间计算资源用于生成抽象比生成更多解决方案更有利于性能提升，这说明了抽象在引导有意义探索中的重要作用。

English

Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.

RLAD：訓練大型語言模型以發現解決推理問題的抽象方法

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

摘要

Support