忘記你對LLM評估的認知 - LLM就像變色龍。

摘要

大型語言模型（LLMs）通常在公開基準測試中表現出色，但這些高分數可能掩蓋了對特定數據集表面提示的過度依賴，而非真正的語言理解。我們引入了變色龍基準過度擬合檢測器（C-BOD），這是一個元評估框架，通過參數變換系統地扭曲基準提示並檢測LLMs的過度擬合。通過重新表達輸入並保留其語義內容和標籤，C-BOD揭示了模型性能是否受到記憶模式的驅動。在使用26個領先的LLMs對MMLU基準進行評估時，我們的方法顯示在輕微干擾下平均性能下降了2.15％，其中26個模型中有20個呈現統計上顯著的差異。值得注意的是，基準準確度較高的模型在干擾下表現出較大的性能差異，而較大的LLMs則更容易對重新表達產生敏感反應，這兩種情況都可能過度依賴固定提示模式。相比之下，Llama系列和基準準確度較低的模型顯示出無關緊要的性能下降，表明對表面提示的依賴減少。此外，C-BOD的數據集和模型不可知設計使其易於集成到訓練流程中，以促進更強大的語言理解。我們的研究結果挑戰社群超越排行榜分數，並優先考慮LLMs評估中的韌性和泛化能力。

English

Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.

忘記你對LLM評估的認知 - LLM就像變色龍。

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

摘要

Support