ChatPaper.aiChatPaper

超越記憶:多模態序數回歸基準揭示視覺語言模型中的流行度偏見

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

December 24, 2025
作者: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
cs.AI

摘要

我們揭露了當前最先進的視覺語言模型(VLMs)中存在顯著的知名度偏差——與普通建築相比,這些模型在著名建築上的準確率最高可提升34%,表明其依賴記憶而非可泛化的理解能力。為系統性研究此現象,我們建立了該任務規模最大的開放基準:YearGuessr數據集,包含來自157個國家、共55,546張具多模態屬性的建築圖像,並標註了連續序數標籤(建造年份跨度为1001-2024年)、GPS數據及作為知名度代理指標的頁面瀏覽量。基於此數據集,我們將建造年份預測任務構建為序數回歸問題,並提出知名度感知的區間準確度指標以量化此偏差。對30餘個模型(包括我們提出的YearCLIP模型)的基準測試結果證實:VLMs在熱門記憶項目上表現優異,但對冷門對象的識別能力顯著不足,暴露出其推理能力的根本缺陷。項目頁面:https://sytwu.github.io/BeyondMemo/
English
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
PDF201December 26, 2025