'아하!'를 넘어서: 대규모 추론 모델에서 체계적인 메타 능력 정렬을 향하여

초록

대형 추론 모델(LRMs)은 이미 장기적인 사고 연쇄 추론에 대한 잠재적 능력을 보유하고 있다. 선행 연구는 결과 기반 강화 학습(RL)이 자기 수정, 역추적, 검증 현상과 같은 고급 추론 행동을 우연히 유발할 수 있음을 보여주었으며, 이러한 현상은 종종 모델의 "아하 순간"이라고 불린다. 그러나 이러한 발현적 행동의 시기와 일관성은 예측 불가능하고 통제 불가능하여, LRMs의 추론 능력의 확장성과 신뢰성을 제한한다. 이러한 한계를 해결하기 위해, 우리는 프롬프트와 우연한 "아하 순간"에 의존하는 것을 넘어섰다. 대신, 우리는 자동 생성된 자기 검증 가능한 작업을 사용하여 모델을 연역, 귀납, 그리고 귀추라는 세 가지 메타 능력과 명시적으로 정렬시켰다. 우리의 세 단계 파이프라인인 개별 정렬, 매개변수 공간 병합, 그리고 도메인 특화 강화 학습은 지시 튜닝된 기준선 대비 10% 이상의 성능 향상을 이끌어냈다. 더 나아가, 정렬된 체크포인트에서 도메인 특화 RL을 수행하면 수학, 코딩, 과학 벤치마크에서 평균 2%의 추가 성능 상승을 보여주며, 명시적 메타 능력 정렬이 추론을 위한 확장 가능하고 신뢰할 수 있는 기반을 제공함을 입증한다. 코드는 https://github.com/zhiyuanhubj/Meta-Ability-Alignment에서 확인할 수 있다.

English

Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional 2\% average gain in the performance ceiling across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment

'아하!'를 넘어서: 대규모 추론 모델에서 체계적인 메타 능력 정렬을 향하여

Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

초록

Support