超越感知误差：大型视觉语言模型中的语义固化问题

摘要

大型视觉语言模型（VLM）往往依赖熟悉的语义先验，但现有评估方法未能清晰区分感知失败与规则映射失败。我们将这种行为定义为语义固化：即使提示指定了另一种同样有效的映射规则，模型仍会保持默认解释。为分离该效应，我们提出VLM-Fix基准测试框架，通过在四种抽象策略游戏中，对相同终局棋盘状态分别采用标准与逆向规则进行配对评估。针对14个开源与闭源VLM的测试表明，模型在标准规则下的准确率持续占优，揭示出显著的语义固化差距。提示干预实验验证了该机制：使用中性代称提示能大幅缩小逆向规则差距，而语义负载代称则会重新扩大差距。后训练过程呈现强规则对齐特性：单一规则训练可提升同规则迁移性能但损害反向规则迁移，联合规则训练则能增强泛化迁移能力。为验证合成游戏之外的泛化性，我们在VLMBias数据集上实施类似去熟悉化干预，观察到相同的定性规律。最后，通过后期层激活导向可部分恢复性能退化，表明语义固化错误至少能通过后期表征进行修正。项目页面、代码及数据集详见https://maveryn.github.io/vlm-fix/。

English

Large vision-language models (VLMs) often rely on familiar semantic priors, but existing evaluations do not cleanly separate perception failures from rule-mapping failures. We study this behavior as semantic fixation: preserving a default interpretation even when the prompt specifies an alternative, equally valid mapping. To isolate this effect, we introduce VLM-Fix, a controlled benchmark over four abstract strategy games that evaluates identical terminal board states under paired standard and inverse rule formulations. Across 14 open and closed VLMs, accuracy consistently favors standard rules, revealing a robust semantic-fixation gap. Prompt interventions support this mechanism: neutral alias prompts substantially narrow the inverse-rule gap, while semantically loaded aliases reopen it. Post-training is strongly rule-aligned: training on one rule improves same-rule transfer but hurts opposite-rule transfer, while joint-rule training improves broader transfer. To test external validity beyond synthetic games, we evaluate analogous defamiliarization interventions on VLMBias and observe the same qualitative pattern. Finally, late-layer activation steering partially recovers degraded performance, indicating that semantic-fixation errors are at least partly editable in late representations. Project page, code, and dataset available at https://maveryn.github.io/vlm-fix/.