基准设计者应"在测试集上训练"以揭示可被利用的非视觉捷径
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
November 6, 2025
作者: Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie
cs.AI
摘要
稳健的基准测试对于评估多模态大语言模型(MLLMs)至关重要。然而我们发现,许多模型无需强大的视觉理解能力即可在多项多模态基准测试中取得优异成绩,这实际上是利用了数据偏差、语言先验和表面模式。对于本需依赖视觉输入的以视觉为核心的基准测试而言,这一问题尤为严重。我们采用一种诊断式基准设计原则:可被钻空子的基准终将被钻空子。因此设计者应率先尝试"破解"自身设计的基准,通过诊断与去偏差流程系统性地识别并消除非视觉偏差。有效的诊断需要直接"在测试集上训练"——通过探查已发布测试集固有的可被利用的模式来实现。
我们将这一标准具体化为两个组成部分。首先采用"测试集压力测试"(TsT)方法诊断基准的脆弱性。主要诊断工具涉及通过k折交叉验证,仅基于测试集的非视觉文本输入对强大语言模型进行微调,以揭示捷径性能并为每个样本分配偏差分数s(x)。同时辅以基于随机森林的轻量级诊断方法,该方案通过手工构建的特征实现快速可解释的审计。其次,我们通过"迭代偏差剪枝"(IBP)程序过滤高偏差样本以实现基准去偏差。将该框架应用于VSI-Bench、CV-Bench、MMMU和VideoMME四个基准测试后,我们发现了普遍存在的非视觉偏差。作为案例研究,我们应用完整框架创建了VSI-Bench-Debiased,结果显示其非视觉可解性显著降低,且视觉盲测性能差距较原始基准更为显著。
English
Robust benchmarks are crucial for evaluating Multimodal Large Language Models
(MLLMs). Yet we find that models can ace many multimodal benchmarks without
strong visual understanding, instead exploiting biases, linguistic priors, and
superficial patterns. This is especially problematic for vision-centric
benchmarks that are meant to require visual inputs. We adopt a diagnostic
principle for benchmark design: if a benchmark can be gamed, it will be.
Designers should therefore try to ``game'' their own benchmarks first, using
diagnostic and debiasing procedures to systematically identify and mitigate
non-visual biases. Effective diagnosis requires directly ``training on the test
set'' -- probing the released test set for its intrinsic, exploitable patterns.
We operationalize this standard with two components. First, we diagnose
benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology.
Our primary diagnostic tool involves fine-tuning a powerful Large Language
Model via k-fold cross-validation on exclusively the non-visual, textual
inputs of the test set to reveal shortcut performance and assign each sample a
bias score s(x). We complement this with a lightweight Random Forest-based
diagnostic operating on hand-crafted features for fast, interpretable auditing.
Second, we debias benchmarks by filtering high-bias samples using an
``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four
benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive
non-visual biases. As a case study, we apply our full framework to create
VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider
vision-blind performance gap than the original.