RLAD: 추론 문제 해결을 위한 추상화 발견을 위한 대형 언어 모델 학습

초록

추론은 패턴 매칭이나 해결책의 암기를 넘어서서 어려운 문제에 대한 답을 도출할 수 있는 "알고리즘적 절차"를 식별하고 구현하는 것을 요구합니다. 이를 위해서는 가장 관련성이 높은 기본 요소, 중간 결과 또는 공유 절차를 인식하고 이를 기반으로 구축해야 합니다. 긴 사고 사슬에 대한 사후 훈련을 통해 강화 학습(RL)은 궁극적으로 이러한 종류의 알고리즘적 행동을 발견하는 것을 목표로 하지만, 대형 모델이 학습한 대부분의 추론 흔적은 절차를 일관되게 포착하거나 재사용하는 데 실패하고, 대신 장황하고 퇴화된 탐색으로 흐르는 경향이 있습니다. 더 효과적인 추론을 위해, 우리는 추론 추상화를 도입합니다: 이는 절차적 및 사실적 지식에 대한 간결한 자연어 설명으로, 모델이 성공적인 추론을 학습하도록 안내합니다. 우리는 모델이 주어진 문제에 대해 여러 추상화를 제안할 수 있도록 훈련시킨 다음, 이러한 추상화가 제공하는 정보를 사용하여 해결책을 구축하도록 강화 학습(RL)을 적용합니다. 이는 추상화 생성기와 해결책 생성기를 공동으로 훈련시키는 두 플레이어 RL 훈련 패러다임(RLAD로 약칭)으로, 구조화된 탐색을 효과적으로 가능하게 하고, 추상화 제안과 해결책 생성의 학습 신호를 분리하며, 더 어려운 문제에 대한 일반화를 개선합니다. 또한, 테스트 시 더 많은 계산 자원을 추상화 생성에 할당하는 것이 큰 테스트 예산에서 더 많은 해결책을 생성하는 것보다 성능에 더 유리하다는 것을 보여줌으로써, 의미 있는 탐색을 안내하는 데 있어 추상화의 역할을 입증합니다.

English

Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement "algorithmic procedures" that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.

RLAD: 추론 문제 해결을 위한 추상화 발견을 위한 대형 언어 모델 학습

RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems

초록

Support