区切り文字トークンスケーリングによる複数画像理解の強化

要旨

大規模視覚言語モデル（LVLM）は単一画像タスクにおいて強力な性能を発揮するが、複数の画像が入力されると性能が低下する。主な原因の一つは、モデルが異なる画像間の情報を区別するのに苦労する「画像間情報漏洩」である。既存のLVLMでは各画像の開始と終了を区切るデリミタトークンが採用されているが、我々の分析によれば、これらのトークンは画像間情報漏洩を効果的に遮断できていない。その効果を高めるため、我々はデリミタトークンの隠れ状態をスケーリングする手法を提案する。これにより、画像内相互作用を強化し、望ましくない画像間相互作用を抑制することで、画像固有の情報を保持するモデルの能力が向上する。その結果、モデルは画像をより明確に区別し、より正確に推論できるようになる。Mantis、MuirBench、MIRB、QBench2といった複数画像ベンチマークでの性能向上が実験により示されている。さらに、明確な区別を要するテキストのみのタスクでも本手法を評価した。TQABench、MultiNews、WCEP-10を含む複数文書・複数表理解ベンチマークにおいて性能向上が認められる。特筆すべきは、本手法が追加の学習や推論コストを一切必要としない点である。

English

Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

区切り文字トークンスケーリングによる複数画像理解の強化

Enhancing Multi-Image Understanding through Delimiter Token Scaling

要旨

Support