通过分隔符标记缩放增强多图像理解能力

摘要

大型视觉语言模型（LVLM）在单图像任务中表现出色，但在处理多图像输入时性能会下降。其主要原因之一是跨图像信息泄露问题——模型难以区分不同图像间的信息。现有LVLM虽已采用分隔符标记每张图像的起止位置，但我们的分析表明这些标记未能有效阻断跨图像信息泄露。为提升其有效性，我们提出一种对分隔符标记的隐藏状态进行缩放的方法。该方法通过强化图像内部交互并限制不必要的跨图像交互，增强了模型保留图像特定信息的能力，从而使模型能更好地区分图像并进行更精准的推理。实验结果表明，该方法在Mantis、MuirBench、MIRB和QBench2等多图像基准测试中均取得性能提升。我们进一步在需要清晰区分的纯文本任务上评估本方法，其在TQABench、MultiNews和WCEP-10等多文档/多表格理解基准测试中同样表现出性能改进。值得注意的是，该方法无需增加任何训练或推理成本。

English

Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

通过分隔符标记缩放增强多图像理解能力

Enhancing Multi-Image Understanding through Delimiter Token Scaling

摘要

Support