谁是你的裁判？论大语言模型生成判决的可检测性

摘要

基于大语言模型（LLM）的评判利用强大的LLM高效评估候选内容并提供评分。然而，LLM生成评判中固有的偏见和脆弱性引发了担忧，尤其在学术同行评审等敏感场景中，亟需对其进行区分。在本研究中，我们提出并形式化了评判检测任务，系统性地探究了LLM生成评判的可检测性。与LLM生成文本检测不同，评判检测仅依赖于评分和候选内容，这反映了现实场景中检测过程往往缺乏文本反馈的情况。初步分析表明，现有LLM生成文本检测方法因无法捕捉评分与候选内容间的交互——这是有效评判检测的关键——而表现不佳。受此启发，我们引入了J-Detector，一个轻量级且透明的神经检测器，通过显式提取的语言特征和LLM增强特征，将LLM评判者的偏见与候选内容属性相链接，以实现精准检测。跨多样数据集的实验验证了J-Detector的有效性，并展示了其可解释性如何量化LLM评判者的偏见。最后，我们分析了影响LLM生成评判可检测性的关键因素，并在实际场景中验证了评判检测的实用价值。

English

Large Language Model (LLM)-based judgments leverage powerful LLMs to efficiently evaluate candidate content and provide judgment scores. However, the inherent biases and vulnerabilities of LLM-generated judgments raise concerns, underscoring the urgent need for distinguishing them in sensitive scenarios like academic peer reviewing. In this work, we propose and formalize the task of judgment detection and systematically investigate the detectability of LLM-generated judgments. Unlike LLM-generated text detection, judgment detection relies solely on judgment scores and candidates, reflecting real-world scenarios where textual feedback is often unavailable in the detection process. Our preliminary analysis shows that existing LLM-generated text detection methods perform poorly given their incapability to capture the interaction between judgment scores and candidate content -- an aspect crucial for effective judgment detection. Inspired by this, we introduce J-Detector, a lightweight and transparent neural detector augmented with explicitly extracted linguistic and LLM-enhanced features to link LLM judges' biases with candidates' properties for accurate detection. Experiments across diverse datasets demonstrate the effectiveness of J-Detector and show how its interpretability enables quantifying biases in LLM judges. Finally, we analyze key factors affecting the detectability of LLM-generated judgments and validate the practical utility of judgment detection in real-world scenarios.

谁是你的裁判？论大语言模型生成判决的可检测性

Who's Your Judge? On the Detectability of LLM-Generated Judgments

摘要

Support