ChatPaper.aiChatPaper

透過捷徑神經元分析建立可信賴的大型語言模型評估

Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis

June 4, 2025
作者: Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao
cs.AI

摘要

大型語言模型(LLMs)的發展依賴於可信的評估。然而,目前大多數評估依賴於公開基準測試,這些基準測試容易受到數據污染問題的影響,從而嚴重損害公平性。先前的研究主要集中在構建動態基準測試以解決污染問題。然而,持續構建新的基準測試既成本高昂又具有循環性。在本研究中,我們旨在通過分析受污染模型本身的機制來解決污染問題。通過實驗,我們發現受污染模型的高估可能是由於參數在訓練中獲得了捷徑解決方案。我們進一步提出了一種通過比較和因果分析來識別捷徑神經元的新方法。基於此,我們引入了一種稱為捷徑神經元修補的評估方法,以抑制捷徑神經元。實驗驗證了我們的方法在減輕污染方面的有效性。此外,我們的評估結果與最近發布的可信基準測試MixEval表現出強烈的線性相關性,Spearman係數(rho)超過0.95。這種高度相關性表明,我們的方法能夠密切揭示模型的真實能力,並且是可信的。我們進行了進一步的實驗,以證明我們的方法在各種基準測試和超參數設置中的普遍適用性。代碼:https://github.com/GaryStack/Trustworthy-Evaluation
English
The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient (rho) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation
PDF262June 5, 2025