通过捷径神经元分析建立可信赖的大语言模型评估体系

摘要

大型语言模型（LLMs）的发展依赖于可信的评估。然而，当前大多数评估依赖于公开基准测试，这些基准容易受到数据污染问题的影响，严重损害了评估的公平性。以往的研究侧重于构建动态基准以应对污染问题，但持续构建新基准既成本高昂又具有周期性。在本研究中，我们旨在通过分析受污染模型自身的机制来解决污染问题。通过实验，我们发现受污染模型的高估现象很可能源于训练过程中参数获取了捷径解。进一步地，我们提出了一种通过比较与因果分析识别捷径神经元的新方法。基于此，我们引入了一种名为“捷径神经元修补”的评估方法，以抑制捷径神经元的作用。实验验证了我们的方法在减轻污染方面的有效性。此外，我们的评估结果与近期发布的可信基准MixEval显示出极强的线性相关性，斯皮尔曼系数（rho）超过0.95。这一高度相关性表明，我们的方法能够准确揭示模型的真实能力，具有可信度。我们进行了更多实验，以证明该方法在不同基准和超参数设置下的普适性。代码详见：https://github.com/GaryStack/Trustworthy-Evaluation

English

The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient (rho) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation

通过捷径神经元分析建立可信赖的大语言模型评估体系

Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

摘要

Support