CLUE: 경험을 통한 비모수적 검증 및 은닉 상태 클러스터링

초록

대규모 언어 모델(LLM)의 출력 품질을 평가하는 것은 중요한 과제로 대두되고 있다. 기존의 방법들은 텍스트 수준의 정보(예: 보상 모델, 다수결 투표)에 의존하여 표면적인 단서에 과적합될 가능성이 있거나, 토큰 확률로부터 보정된 신뢰도를 활용하여 보정이 덜 된 모델에서는 실패할 수 있다. 그러나 이러한 신호들은 사실 더 풍부한 정보원인 모델의 내부 은닉 상태(hidden states)의 부분적인 투영에 불과하다. 초기 층은 토큰 임베딩에 가까워 텍스트 기반 판단을 뒷받침하는 의미론적 및 어휘적 특징을 보존하는 반면, 후기 층은 출력 로짓(output logits)과 점점 더 일치하며 신뢰도 관련 정보를 내포한다. 본 논문은 검증을 위한 통합된 기반으로서 은닉 상태를 직접 탐구한다. 우리는 해결책의 정확성이 은닉 활성화 궤적 내에서 기하학적으로 분리 가능한 서명으로 인코딩됨을 보여준다. 이를 검증하기 위해, 의도적으로 미니멀리스트적이고 비모수적인 검증 도구인 Clue(Clustering and Experience-based Verification)를 제시한다. 학습 가능한 매개변수가 없는 CLUE는 각 추론 흔적을 은닉 상태 델타로 요약하고, 과거 경험으로 형성된 '성공' 및 '실패' 클러스터에 대한 최근접 중심 거리를 통해 정확성을 분류한다. 이 방법의 단순성은 근본적인 신호의 강점을 부각시킨다. 실험적으로, CLUE는 LLM-as-a-judge 기준선을 지속적으로 능가하며, AIME 24/25와 GPQA에서 후보 재순위화에 있어 현대적인 신뢰도 기반 방법과 동등하거나 이를 초과하는 성능을 보여준다. 특히, 1.5B 모델을 사용한 AIME 24에서 CLUE는 정확도를 56.7%(majority@64)에서 70.0%(top-maj@16)로 향상시켰다.

English

Assessing the quality of Large Language Model (LLM) outputs presents a critical challenge. Previous methods either rely on text-level information (e.g., reward models, majority voting), which can overfit to superficial cues, or on calibrated confidence from token probabilities, which would fail on less-calibrated models. Yet both of these signals are, in fact, partial projections of a richer source of information: the model's internal hidden states. Early layers, closer to token embeddings, preserve semantic and lexical features that underpin text-based judgments, while later layers increasingly align with output logits, embedding confidence-related information. This paper explores hidden states directly as a unified foundation for verification. We show that the correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations. To validate this, we present Clue (Clustering and Experience-based Verification), a deliberately minimalist, non-parametric verifier. With no trainable parameters, CLUE only summarizes each reasoning trace by an hidden state delta and classifies correctness via nearest-centroid distance to ``success'' and ``failure'' clusters formed from past experience. The simplicity of this method highlights the strength of the underlying signal. Empirically, CLUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates, improving both top-1 and majority-vote accuracy across AIME 24/25 and GPQA. As a highlight, on AIME 24 with a 1.5B model, CLUE boosts accuracy from 56.7% (majority@64) to 70.0% (top-maj@16).

CLUE: 경험을 통한 비모수적 검증 및 은닉 상태 클러스터링

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

초록

Support