超越记忆:多模态序数回归基准揭示视觉语言模型中的流行度偏见
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
December 24, 2025
作者: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
cs.AI
摘要
我们揭示了当前先进视觉语言模型(VLMs)中存在显著的流行度偏差:相比普通建筑,模型在著名建筑上的识别准确率最高可提升34%,这表明其依赖记忆而非泛化理解能力。为系统研究此现象,我们推出了该领域规模最大的开放基准数据集YearGuessr——包含来自157个国家的55,546张建筑图像,每张图像均标注有连续序数形式的建造年份(1001-2024)、GPS数据,以及作为流行度代理指标的页面浏览量。基于该数据集,我们将建造年份预测任务构建为序数回归问题,并引入流行度感知区间准确率指标来量化这种偏差。通过对30余个模型(包括我们提出的YearCLIP模型)的测试,基准结果证实:VLMs在热门记忆项上表现优异,但对非知名对象的识别能力显著不足,暴露出其推理能力的根本缺陷。项目页面:https://sytwu.github.io/BeyondMemo/
English
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/