ChatPaper.aiChatPaper

反思大语言模型作为评判者:基于语义能力不对称的小语言模型表征评判法

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

January 30, 2026
作者: Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He
cs.AI

摘要

大型语言模型(LLM)目前常通过提示工程作为无参考评估器使用,但这种"LLM即评委"范式存在成本高昂、机制不透明且对提示设计敏感等问题。本研究探索了小模型能否通过利用内部表征而非表层文本来担任高效评估器。我们发现了一个稳定的经验规律:尽管小模型生成能力较弱,但其隐藏状态中编码了丰富的评估信号。这促使我们提出语义容量不对称假说:评估任务所需的语义容量远低于生成任务,且可基于中间表征实现,表明评估未必依赖大规模生成模型,而能利用小模型的潜在特征。这一发现推动了从"LLM即评委"到"表征即评委"的范式转变——后者采用无需解码的评估策略,通过探查模型内部结构而非提示输出来实现评估。我们通过INSPECTOR框架实例化了该范式,该基于探针的框架能从小模型表征中预测细粒度评估分数。在推理基准测试(GSM8K、MATH、GPQA)上的实验表明,INSPECTOR显著优于基于提示的小模型评估方法,并接近全量LLM评委的性能,同时为可扩展评估提供了更高效、可靠且可解释的替代方案。
English
Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.
PDF52March 12, 2026