同源异答：多模态大语言模型中的跨模态不一致性

摘要

我们推出两项新基准测试REST与REST+（渲染等价压力测试），旨在系统评估多模态大语言模型（MLLMs）的跨模态不一致性问题。尽管MLLMs经过训练可将视觉与语言映射至同一嵌入空间，但它们无法在两种模态中执行相同任务。我们的基准测试包含三种模态（图像、文本、混合）下具有相同语义信息的样本，并证明当前最先进的MLLMs无法对这些不同模态进行一致推理。通过评估15个MLLMs，我们发现即使排除文本识别（OCR）问题，模态不一致程度仍存在显著差异。无论是将文本渲染为图像还是将图像渲染为文本，都无法解决不一致性问题。即使OCR准确无误，视觉特征（文字颜色和分辨率，但非字体）及视觉标记数量仍会影响模型性能。最后，我们发现一致性分数与文本-图像间的模态差距存在关联，这揭示了跨模态不一致MLLMs的内在机制。

English

We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.

同源异答：多模态大语言模型中的跨模态不一致性

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

摘要

Support