자기 개선이 가능한 추론자를 가능하게 하는 인지적 행동들, 또는 고효율 STaR의 네 가지 습관

초록

테스트 시간 추론(test-time inference)은 언어 모델이 숙련된 인간 전문가처럼 복잡한 문제에 대해 더 오래, 더 신중하게 '생각'할 수 있게 하는 강력한 패러다임으로 부상했습니다. 강화 학습(RL)은 검증 가능한 작업에서 언어 모델의 자기 개선을 이끌 수 있지만, 일부 모델은 상당한 성과를 보이는 반면 다른 모델은 빠르게 정체되는 현상을 보입니다. 예를 들어, 우리는 Countdown 게임에서 동일한 RL 훈련을 받은 Qwen-2.5-3B가 Llama-3.2-3B를 훨씬 능가하는 것을 발견했습니다. 이러한 차이는 중요한 질문을 제기합니다: 어떤 내재적 특성이 효과적인 자기 개선을 가능하게 하는가? 우리는 이 질문을 탐구하기 위해 전문가 인간 문제 해결자와 성공적인 언어 모델이 모두 사용하는 네 가지 주요 인지 행동 -- 검증(verification), 역추적(backtracking), 하위 목표 설정(subgoal setting), 역방향 연결(backward chaining) --을 분석하는 프레임워크를 소개합니다. 우리의 연구는 Qwen이 이러한 추론 행동을 자연스럽게 보이는 반면, Llama는 초기에 이를 결여하고 있음을 보여줍니다. 통제된 행동 데이터셋을 사용한 체계적인 실험에서, 우리는 이러한 추론 행동을 포함한 예제로 Llama를 사전 준비(priming)하면 RL 동안 상당한 개선이 이루어져 Qwen의 성능을 따라잡거나 능가한다는 것을 발견했습니다. 중요한 것은, 답변의 정확성이 아니라 추론 행동의 존재가 결정적인 요인이라는 점입니다 -- 적절한 추론 패턴을 포함한 잘못된 솔루션으로 사전 준비된 모델은 올바른 솔루션으로 훈련된 모델과 비슷한 성능을 달성합니다. 마지막으로, 추론 행동을 증폭하도록 필터링된 OpenWebMath 데이터를 사용한 지속적인 사전 훈련을 통해 Llama 모델은 Qwen의 자기 개선 궤적을 따라잡을 수 있습니다. 우리의 연구 결과는 초기 추론 행동과 개선 능력 사이의 근본적인 관계를 확립하며, 왜 일부 언어 모델은 추가 계산을 효과적으로 활용하는 반면 다른 모델은 정체되는지를 설명합니다.

English

Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.

자기 개선이 가능한 추론자를 가능하게 하는 인지적 행동들, 또는 고효율 STaR의 네 가지 습관

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

초록

Support