벤치마크 설계자는 악용 가능한 비시각적 단서를 드러내기 위해 "테스트 세트로 훈련"해야 한다

초록

강력한 벤치마크는 멀티모달 대규모 언어 모델(MLLM) 평가에 필수적입니다. 그러나 우리는 모델이 강력한 시각 이해 능력 없이도 편향, 언어적 사전 지식, 피상적 패턴을 활용하여 많은 멀티모달 벤치마크에서 높은 성적을 낼 수 있음을 발견했습니다. 이는 시각 입력이 필요하다고 설계된 시각 중심 벤치마크에서 특히 문제가 됩니다. 우리는 벤치마크 설계를 위한 진단 원칙을 채택합니다: 벤치마크가 조작될 수 있다면, 결국 조작될 것이다. 따라서 설계자는 진단 및 편향 제거 절차를 사용하여 체계적으로 비시각적 편향을 식별하고 완화하기 위해 먼저 자신의 벤치마크를 '조작'하려고 시도해야 합니다. 효과적인 진단은 "시험 세트에 대한 훈련"을 직접 수행하는 것, 즉 공개된 시험 세트의 내재적이고 활용 가능한 패턴을 탐색하는 것을 요구합니다. 우리는 이 기준을 두 가지 구성 요소로 구체화합니다. 첫째, "시험 세트 스트레스 테스트"(TsT) 방법론을 사용하여 벤치마크 취약성을 진단합니다. 우리의 주요 진단 도구는 강력한 대규모 언어 모델을 시험 세트의 비시각적 텍스트 입력만으로 k-폴드 교차 검증을 통해 미세 조정하여 숏컷 성능을 드러내고 각 샘플에 편향 점수 s(x)를 할당하는 것입니다. 이를 보완하기 위해 수작업으로 추출한 특징에 기반한 경량 Random Forest 진단법을 통해 빠르고 해석 가능한 감사를 수행합니다. 둘째, "반복적 편향 제거"(IBP) 절차를 사용하여 고편향 샘플을 필터링하여 벤치마크의 편향을 제거합니다. 이 프레임워크를 네 가지 벤치마크(VSI-Bench, CV-Bench, MMMU, VideoMME)에 적용하여 만연한 비시각적 편향을 발견했습니다. 사례 연구로 우리의 전체 프레임워크를 적용하여 VSI-Bench-Debiased를 생성했으며, 원본보다 비시각적 해결 가능성이 감소하고 시각 정보 차단 성능 격차가 더 커짐을 입증했습니다.

English

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via k-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score s(x). We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

벤치마크 설계자는 악용 가능한 비시각적 단서를 드러내기 위해 "테스트 세트로 훈련"해야 한다

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

초록

Support