Benchmarkontwerpers Moeten "Trainen op de Testset" om Uitbuitbare Niet-Visuele Shortcuts Bloot te Legen

Samenvatting

Robuste benchmarks zijn cruciaal voor de evaluatie van Multimodale Large Language Models (MLLM's). Toch stellen wij vast dat modellen veel multimodale benchmarks kunnen 'kraken' zonder een sterk visueel begrip, door in plaats daarvan gebruik te maken van biases, linguïstische aannames en oppervlakkige patronen. Dit is vooral problematisch voor visie-centrische benchmarks die juist visuele input vereisen. Wij hanteren een diagnostisch principe voor benchmark-ontwerp: als een benchmark te manipuleren is, zal dat ook gebeuren. Ontwerpers moeten daarom proberen hun eigen benchmarks eerst te 'manipuleren' door middel van diagnostische en debiasing-procedures om niet-visuele biases systematisch te identificeren en te mitigeren. Effectieve diagnose vereist direct 'trainen op de testset' – het onderzoeken van de vrijgegeven testset op haar intrinsieke, uitbuitbare patronen. Wij operationaliseren deze standaard met twee componenten. Ten eerste diagnosticeren we de gevoeligheid van een benchmark met behulp van een "Test-set Stress-Test" (TsT) methodologie. Ons primaire diagnostische instrument bestaat uit het fine-tunen van een krachtige Large Language Model via k-fold kruisvalidatie, uitsluitend op de niet-visuele, tekstuele inputs van de testset, om shortcut-prestaties bloot te leggen en elk sample een bias-score s(x) toe te kennen. Dit vullen we aan met een lichtgewicht, op Random Forest gebaseerde diagnostiek die werkt op handmatig gemaakte kenmerken voor snelle, interpreteerbare auditing. Ten tweede zuiveren we benchmarks van bias door samples met een hoge bias eruit te filteren met een "Iteratieve Bias Snoei" (IBP) procedure. Door dit raamwerk toe te passen op vier benchmarks – VSI-Bench, CV-Bench, MMMU en VideoMME – leggen wij alomtegenwoordige niet-visuele biases bloot. Als casestudy passen we ons volledige raamwerk toe om VSI-Bench-Debiased te creëren, wat een verminderde niet-visuele oplosbaarheid en een grotere prestatiekloof met visie-uitgeschakelde modellen demonstreert vergeleken met het origineel.

English

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via k-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score s(x). We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

Benchmarkontwerpers Moeten "Trainen op de Testset" om Uitbuitbare Niet-Visuele Shortcuts Bloot te Legen

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Samenvatting

Support