効率的なVLMのための高解像度領域検索：注目すべき箇所を見極める

要旨

視覚言語モデル（VLM）は通常、高解像度の画像をそのまま処理するため、精度と計算効率の間でトレードオフが生じます。高解像度入力は細部を捉えられますが計算コストが大きく、低解像度入力は効率的であるものの、小さな文字などの重要な視覚情報を見落とす可能性があります。本論文では、この精度と効率のトレードオフを解決するAwaResを提案します。これは、低解像度の全体像を処理しつつ、クエリに応じてツール呼び出しにより必要な高解像度部分のみを取得する、オンデマンド空間フレームワークです。教師データは自動構築します。すなわち、判定器が低解像度と高解像度の回答を比較してクロップの必要性をラベル付けし、正解の根拠を位置特定するオラクルグラウンディングモデルを用いて、離散的なクロップ集合にマッピングし、マルチターンツール使用軌跡を形成します。本フレームワークは、コールドスタートSFTで初期化後、意味的正答率と明示的なクロップコストペナルティを組み合わせた複合報酬を用いたマルチターンGRPOで学習します。プロジェクトページ: https://nimrodshabtay.github.io/AwaRes

English

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

効率的なVLMのための高解像度領域検索：注目すべき箇所を見極める

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

要旨

Support