ERGO: 視覚言語モデルのための効率的な高解像度視覚理解

要旨

高解像度画像の効率的な処理は、実世界の視覚言語アプリケーションにおいて極めて重要である。しかし、既存の大規模視覚言語モデル（LVLM）は、多数の視覚トークンに起因する膨大な計算コストを伴う。「画像を用いた思考」モデルの登場により、推論はテキストを超えて視覚領域にまで拡張されている。この能力を活用し、我々は「粗から細へ」という二段階の推論パイプラインを提案する。まず、ダウンサンプリングされた画像を分析し、タスクに関連する領域を特定する。次に、これらの領域のみをフル解像度で切り取り、後続の推論段階で処理する。このアプローチにより、必要な箇所で細かな視覚的詳細を保ちつつ、計算コストを削減する。主要な課題は、与えられたクエリに真に関連する領域を推論することにある。最近の関連手法では、入力画像のダウンサンプリング後の第一段階で、効果的な推論に明確な視覚情報が必要とされる知覚主導型推論のために失敗することが多い。この問題に対処するため、我々はERGO（Efficient Reasoning & Guided Observation）を提案する。ERGOは、推論主導型の知覚を実行し、マルチモーダルな文脈を活用して焦点を当てるべき領域を決定する。我々のモデルは、知覚的不確実性を考慮し、視覚的に曖昧な領域をカバーするために切り取る領域を拡張することで、質問に答えることができる。この目的のために、我々は強化学習フレームワークにおいて、粗から細への知覚のためのシンプルでありながら効果的な報酬コンポーネントを開発した。複数のデータセットにおいて、我々のアプローチは、元のモデルや競合手法よりも高い精度を達成し、かつ効率性も向上させた。例えば、ERGOはV*ベンチマークにおいてQwen2.5-VL-7Bを4.7ポイント上回り、視覚トークンの23%のみを使用して3倍の推論速度向上を実現した。コードとモデルは以下で公開されている：https://github.com/nota-github/ERGO。

English

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

ERGO: 視覚言語モデルのための効率的な高解像度視覚理解

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

要旨

Support