추론 코어: LLM 기호 추론을 위한 확장 가능한 RL 환경

초록

우리는 대규모 언어 모델(LLMs)의 기초적인 기호 추론 능력을 발전시키기 위해 설계된 새로운 확장 가능한 환경인 'Reasoning Core'를 소개한다. 이 환경은 검증 가능한 보상이 있는 강화 학습(RLVR)을 위한 것으로, 기존의 게임이나 고립된 퍼즐에 초점을 맞춘 벤치마크와는 달리 PDDL 계획, 일차 논리, 문맥 자유 문법 파싱, 인과 관계 추론, 시스템 방정식 풀이 등 핵심 형식적 영역에 걸쳐 절차적으로 문제를 생성한다. 이 환경은 고도의 일반성 문제 분포, 외부 도구를 통한 검증, 지속적인 난이도 제어라는 핵심 설계 원칙에 기반을 두고 있어, 사실상 무한한 새로운 훈련 인스턴스를 제공한다. 최첨단 LLMs를 이용한 초기 제로샷 평가 결과, Reasoning Core의 과제들이 상당히 어려운 것으로 확인되어, 향후 모델의 추론 능력을 향상시키기 위한 유망한 자원으로 자리매김할 것으로 기대된다.

English

We introduce Reasoning Core, a new scalable environment for Reinforcement Learning with Verifiable Rewards (RLVR), designed to advance foundational symbolic reasoning in Large Language Models (LLMs). Unlike existing benchmarks that focus on games or isolated puzzles, Reasoning Core procedurally generates problems across core formal domains, including PDDL planning, first-order logic, context-free grammar parsing, causal reasoning, and system equation solving. The environment is built on key design principles of high-generality problem distributions, verification via external tools, and continuous difficulty control, which together provide a virtually infinite supply of novel training instances. Initial zero-shot evaluations with frontier LLMs confirm the difficulty of Reasoning Core's tasks, positioning it as a promising resource to improve the reasoning capabilities of future models.

추론 코어: LLM 기호 추론을 위한 확장 가능한 RL 환경

Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning

초록

Support