反思大型語言模型作為評判者:基於語義能力不對稱性的小型語言模型表徵評判法
Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
January 30, 2026
作者: Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He
cs.AI
摘要
大型語言模型(LLMs)目前普遍通過提示詞工程被用作無參考評估器,但這種「LLM即評判」範式存在成本高昂、過程不透明且對提示設計敏感的侷限性。本研究探討小型模型能否通過利用內部表徵而非表面文本來充當高效評估器。我們發現了一個穩定的經驗規律:儘管生成能力較弱,小型語言模型的隱藏狀態中仍編碼了豐富的評估信號。據此提出「語義能力非對稱假說」:評估任務所需的語義能力遠低於生成任務,且可基於中間表徵實現,這表明評估不必依賴大規模生成模型,而能利用小型模型的潛在特徵。該發現推動了從「LLM即評判」到「表徵即評判」的範式轉變——這種免解碼的評估策略通過探測模型內部結構而非依賴提示輸出來實現。我們據此構建了INSPECTOR框架,該基於探測的架構可從小模型表徵中預測細粒度評估分數。在推理基準測試(GSM8K、MATH、GPQA)上的實驗表明,INSPECTOR顯著優於基於提示詞的小型模型,並能逼近完整LLM評判者的性能,同時為可擴展評估提供了更高效、可靠且可解釋的替代方案。
English
Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.