테스트 시간에 자기 반영적 생성

초록

대형 언어 모델(LLMs)은 점점 더 긴 사고의 연쇄를 통해 복잡한 추론 과제를 해결하고 있지만, 이들의 순방향 자동회귀 생성 과정은 취약하다. 초기 토큰 오류가 연쇄적으로 발생할 수 있어, 자기 반성 메커니즘의 필요성이 명확히 대두되고 있다. 그러나 기존의 자기 반성은 전체 초안에 대한 수정을 수행하거나 비용이 많이 드는 학습을 통해 자기 수정을 배우는 방식으로, 근본적으로 반응적이고 비효율적이다. 이를 해결하기 위해, 우리는 테스트 시간에 자기 반성적 생성을 수행하는 경량 프레임워크인 SRGen(Self-Reflective Generation at Test Time)을 제안한다. 토큰 생성 과정에서 SRGen은 동적 엔트로피 임계값을 활용하여 높은 불확실성을 가진 토큰을 식별한다. 식별된 각 토큰에 대해, SRGen은 특정 수정 벡터를 학습하여 이미 생성된 문맥을 최대한 활용하여 토큰 확률 분포를 수정하는 자기 반성적 생성을 수행한다. 부분 출력을 회고적으로 분석함으로써, 이 자기 반성은 더 신뢰할 수 있는 결정을 가능하게 하여, 높은 불확실성 지점에서의 오류 확률을 크게 줄인다. 도전적인 수학적 추론 벤치마크와 다양한 LLMs에 대해 평가한 결과, SRGen은 모델 추론을 일관되게 강화할 수 있음을 보여준다: 단일 패스 품질의 개선은 더 강력한 자기 일관성 투표로도 이어진다. 특히, AIME2024에서 DeepSeek-R1-Distill-Qwen-7B를 사용한 경우, SRGen은 Pass@1에서 +12.0%, Cons@5에서 +13.3%의 절대적 개선을 달성했다. 또한, 우리의 연구 결과는 SRGen을 생성 과정에 반성을 통합하여 신뢰할 수 있는 LLM 추론을 가능하게 하는 플러그 앤 플레이 방식으로 위치시킨다. 이는 일관된 성과를 유지하면서도 제한된 오버헤드와 다른 학습 시간(예: RLHF) 및 테스트 시간(예: SLOT) 기법과의 광범위한 조합성을 달성한다.

English

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.