LLM-as-a-Judge 재고찰: 의미적 역량 비대칭을 통한 소형 언어 모델 기반 Representation-as-a-Judge

초록

대규모 언어 모델(LLM)은 프롬프팅을 통해 레퍼런스 없이 평가를 수행하는 평가자로 널리 사용되지만, 이러한 "LLM-as-a-Judge" 패러다임은 비용이 높고 불투명하며 프롬프트 설계에 민감한 한계가 있다. 본 연구에서는 더 작은 모델이 표면적 생성이 아닌 내부 표현을 활용하여 효율적인 평가자 역할을 할 수 있는지 조사한다. 우리는 일관된 실증적 패턴을 발견했다: 생성 능력이 약한 소형 언어 모델도 은닉 상태에 풍부한 평가 신호를 인코딩한다는 것이다. 이는 우리가 '의미적 능력 비대칭 가설(Semantic Capacity Asymmetry Hypothesis)'을 제안하는 동기가 되었는데, 즉 평가는 생성에 비해 상당히 적은 의미적 능력을 요구하며 중간 표현에 기반할 수 있다는 것이다. 이는 평가가 반드시 대규모 생성 모델에 의존할 필요 없이, 더 작은 모델의 잠재적 특징을 활용할 수 있음을 시사한다. 우리의 발견은 LLM-as-a-Judge 패러다임에서 'Representation-as-a-Judge' 패러다임으로의 전환을 촉진한다. 이는 디코딩이 필요 없는 평가 전략으로, 프롬프트 기반 출력에 의존하기보다 모델의 내부 구조를 탐색한다. 우리는 INSPECTOR라는 프로빙 기반 프레임워크를 통해 이 패러다임을 구체화했으며, 소형 모델의 표현으로부터 세부 항목별 평가 점수를 예측한다. 추론 벤치마크(GSM8K, MATH, GPQA)에서의 실험 결과, INSPECTOR는 프롬프팅 기반 소형 언어 모델을 크게 능가하고 완전한 LLM 평가자에 근접한 성능을 보였으며, 확장 가능한 평가를 위한 더 효율적이고 신뢰할 수 있으며 해석 가능한 대안을 제공한다.

English

Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

LLM-as-a-Judge 재고찰: 의미적 역량 비대칭을 통한 소형 언어 모델 기반 Representation-as-a-Judge

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

초록

Support