BOW: 병목 현상을 겪는 다음 단어 탐색

초록

대규모 언어 모델(LLMs)은 일반적으로 다음 단어 예측(NWP)을 통해 학습되며, 이는 강력한 표면적 유창성을 제공하지만 견고한 추론을 지원하기에는 부족한 경우가 많다. 본 연구에서는 BOttlenecked next Word exploration (BOW)라는 새로운 강화 학습(RL) 프레임워크를 제안한다. 이 프레임워크는 NWP를 재고하여 추론 병목 현상을 도입한다. 여기서 정책 모델은 다음 토큰을 직접 예측하는 대신 먼저 추론 경로를 생성하며, 이후 고정된 판단 모델이 이 추론 경로만을 기반으로 다음 토큰 분포를 예측한다. 정책 모델은 GRPO를 사용하여 학습되며, 이때 보상은 추론 경로가 다음 단어 복원을 얼마나 효과적으로 촉진하는지를 정량화한다. 다양한 벤치마크에서 평가한 결과, BOW는 기타 지속적 사전 학습 기준선과 비교하여 기본 모델의 일반적 및 다음 단어 추론 능력을 모두 향상시킨 것으로 나타났다. 본 연구 결과는 BOW가 일반적인 NWP의 효과적이고 확장 가능한 대안으로 사용될 수 있음을 보여준다.

English

Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP.

BOW: 병목 현상을 겪는 다음 단어 탐색

BOW: Bottlenecked Next Word Exploration

초록

Support