忘記你對LLM評估的認知 - LLM就像變色龍。
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
February 11, 2025
作者: Nurit Cohen-Inger, Yehonatan Elisha, Bracha Shapira, Lior Rokach, Seffi Cohen
cs.AI
摘要
大型語言模型(LLMs)通常在公開基準測試中表現出色,但這些高分數可能掩蓋了對特定數據集表面提示的過度依賴,而非真正的語言理解。我們引入了變色龍基準過度擬合檢測器(C-BOD),這是一個元評估框架,通過參數變換系統地扭曲基準提示並檢測LLMs的過度擬合。通過重新表達輸入並保留其語義內容和標籤,C-BOD揭示了模型性能是否受到記憶模式的驅動。在使用26個領先的LLMs對MMLU基準進行評估時,我們的方法顯示在輕微干擾下平均性能下降了2.15%,其中26個模型中有20個呈現統計上顯著的差異。值得注意的是,基準準確度較高的模型在干擾下表現出較大的性能差異,而較大的LLMs則更容易對重新表達產生敏感反應,這兩種情況都可能過度依賴固定提示模式。相比之下,Llama系列和基準準確度較低的模型顯示出無關緊要的性能下降,表明對表面提示的依賴減少。此外,C-BOD的數據集和模型不可知設計使其易於集成到訓練流程中,以促進更強大的語言理解。我們的研究結果挑戰社群超越排行榜分數,並優先考慮LLMs評估中的韌性和泛化能力。
English
Large language models (LLMs) often appear to excel on public benchmarks, but
these high scores may mask an overreliance on dataset-specific surface cues
rather than true language understanding. We introduce the Chameleon Benchmark
Overfit Detector (C-BOD), a meta-evaluation framework that systematically
distorts benchmark prompts via a parametric transformation and detects
overfitting of LLMs. By rephrasing inputs while preserving their semantic
content and labels, C-BOD exposes whether a model's performance is driven by
memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our
method reveals an average performance degradation of 2.15% under modest
perturbations, with 20 out of 26 models exhibiting statistically significant
differences. Notably, models with higher baseline accuracy exhibit larger
performance differences under perturbation, and larger LLMs tend to be more
sensitive to rephrasings indicating that both cases may overrely on fixed
prompt patterns. In contrast, the Llama family and models with lower baseline
accuracy show insignificant degradation, suggesting reduced dependency on
superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows
easy integration into training pipelines to promote more robust language
understanding. Our findings challenge the community to look beyond leaderboard
scores and prioritize resilience and generalization in LLM evaluation.Summary
AI-Generated Summary