ChatPaper.aiChatPaper

同源异答:多模态大模型中的跨模态不一致性

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

December 9, 2025
作者: Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano
cs.AI

摘要

我们推出REST与REST+(渲染等价压力测试)两项新基准,旨在系统评估多模态大语言模型中的跨模态不一致性问题。尽管MLLMs被训练用于将视觉与语言表征于同一嵌入空间,但它们无法在两种模态中执行同等任务。我们的基准包含三种模态(图像、文本、混合)下语义信息完全一致的样本,实验表明当前最先进的MLLMs无法对这些不同模态保持一致的推理能力。通过对15个MLLMs的评估,我们发现即使排除文字识别(OCR)问题,模态不一致程度仍存在显著差异。无论是将文本渲染为图像,还是将图像转化为文本,均无法解决不一致性问题。即使OCR准确无误,视觉特征(文字颜色与分辨率,但非字体)和视觉标记数量仍会影响模型性能。最后,我们发现一致性评分与文本-图像间的模态差距存在关联,这为跨模态不一致的MLLMs提供了机制性解释。
English
We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
PDF01December 11, 2025