基准测试设计者应"在测试集上训练"以揭示可被利用的非视觉捷径

摘要

稳健的基准测试对于评估多模态大语言模型（MLLM）至关重要。然而我们发现，许多模型无需强大的视觉理解能力就能在 multimodal 基准测试中取得优异成绩，它们实际是利用了数据偏差、语言先验和表面模式。这对本需依赖视觉输入的以视觉为核心的基准测试尤为不利。我们采用一项诊断性基准设计原则：可被钻空子的基准终将被钻空子。因此设计者应率先尝试"破解"自身设计的基准，通过诊断和去偏差流程系统性地识别并消除非视觉偏差。有效的诊断需要直接"在测试集上训练"——通过探查已发布测试集固有的可被利用模式来实现。我们将这一标准具体化为两个组成部分。首先采用"测试集压力测试"（TsT）方法诊断基准的脆弱性。主要诊断工具涉及通过k折交叉验证，仅基于测试集的非视觉文本输入对强大语言模型进行微调，以揭示捷径性能并为每个样本分配偏差分数s(x)。同时辅以基于随机森林的轻量级诊断方法（利用手工特征实现快速可解释的审计）。其次通过"迭代偏差剪枝"（IBP）流程过滤高偏差样本以实现基准去偏差。将该框架应用于VSI-Bench、CV-Bench、MMMU和VideoMME四个基准测试后，我们发现了普遍存在的非视觉偏差。作为案例研究，我们应用完整框架创建了VSI-Bench-Debiased，结果显示其非视觉可解性显著降低，且视觉盲区性能差距较原版基准更为显著。

English

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via k-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score s(x). We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

基准测试设计者应"在测试集上训练"以揭示可被利用的非视觉捷径

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

摘要

Support