LEOPARD：テキスト豊かなマルチイメージタスク向けのビジョン言語モデル

要旨

テキスト豊富な画像は、プレゼンテーションスライド、スキャンされた文書、ウェブページのスナップショットなど、実世界のアプリケーションで一般的であり、テキストが中心となる視覚要素として全体の理解を導く。複数のテキスト豊富な画像を含むタスクは特に挑戦的であり、個々の画像の内容を理解するだけでなく、複数の視覚的入力にわたる相互関係や論理フローについて推論する必要がある。これらのシナリオの重要性にもかかわらず、現在の多モーダル大規模言語モデル（MLLMs）は、テキスト豊富な複数画像のタスクを処理するのに苦労している。これは、高品質なテキスト豊かなマルチ画像シナリオのための指示チューニングデータセットの希少性と、画像の解像度と視覚的特徴シーケンスの長さのバランスをとる難しさに起因している。これらの課題に対処するために、私たちは\OurMethod を提案する。これは、複数のテキスト豊かな画像を含むビジョン言語タスクを処理するために特別に設計されたMLLMである。まず、テキスト豊かで複数画像のシナリオに適した約100万件の高品質な多モーダル指示チューニングデータを収集した。次に、入力画像の元のアスペクト比と解像度に基づいて視覚的シーケンス長の割り当てを動的に最適化する適応型高解像度マルチ画像エンコーディングモジュールを開発した。幅広いベンチマークを対象とした実験は、当社のモデルがテキスト豊かな複数画像の評価において優れた能力を持ち、一般的なドメインの評価において競争力のある性能を示すことを示している。

English

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

LEOPARD：テキスト豊かなマルチイメージタスク向けのビジョン言語モデル

LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks

要旨

Support