ERGO：面向視覺語言模型的高效高分辨率視覺理解

摘要

高效處理高分辨率圖像對於現實世界的視覺語言應用至關重要。然而，現有的大型視覺語言模型（LVLMs）由於視覺標記數量龐大，產生了大量的計算開銷。隨著“圖像思維”模型的出現，推理現在已從文本領域擴展到視覺領域。這一能力促使我們開發了兩階段的“粗到細”推理管道：首先，對下采樣的圖像進行分析，以識別與任務相關的區域；然後，僅對這些區域進行全分辨率裁剪，並在後續推理階段進行處理。這種方法在保留必要細粒度視覺細節的同時，降低了計算成本。一個主要挑戰在於推斷哪些區域真正與給定查詢相關。最近相關方法在輸入圖像下采樣後的第一階段往往失敗，這是由於感知驅動的推理需要清晰的視覺信息才能有效進行。為解決這一問題，我們提出了ERGO（高效推理與引導觀察），它執行推理驅動的感知，利用多模態上下文來確定關注點。我們的模型能夠考慮感知不確定性，擴展裁剪區域以覆蓋視覺模糊區域來回答問題。為此，我們在強化學習框架中開發了簡單而有效的獎勵組件，用於粗到細的感知。在多個數據集上，我們的方法比原始模型和競爭方法提供了更高的準確性，並且效率更高。例如，ERGO在V*基準上超越了Qwen2.5-VL-7B 4.7分，同時僅使用了23%的視覺標記，實現了3倍的推理加速。代碼和模型可在以下網址找到：https://github.com/nota-github/ERGO。

English

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

ERGO：面向視覺語言模型的高效高分辨率視覺理解

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

摘要

Support