첫 번째 반환, 엔트로피 유도 탐색

초록

검증 가능한 보상으로부터의 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키지만, 불안정한 탐색 문제에 직면해 있습니다. 우리는 FR3E(First Return, Entropy-Eliciting Explore)를 제안합니다. 이는 구조화된 탐색 프레임워크로, 추론 경로에서 높은 불확실성을 가진 결정 지점을 식별하고, 의미론적으로 근거 있는 중간 피드백을 구성하기 위해 타겟팅된 롤아웃을 수행합니다. 우리의 방법은 밀집된 감독에 의존하지 않고도 타겟팅된 지침을 제공합니다. 수학적 추론 벤치마크(AIME24)에서의 실험 결과는 FR3E가 더 안정적인 학습을 촉진하고, 더 길고 일관성 있는 응답을 생성하며, 완전히 정확한 경로의 비율을 증가시킨다는 것을 보여줍니다. 이러한 결과는 더 강력하고 구조화된 탐색을 통해 LLM의 추론 능력을 개선하는 이 프레임워크의 효과를 강조합니다.

English

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.