ERGO：面向视觉-语言模型的高效高分辨率视觉理解

摘要

高效处理高分辨率图像对于现实世界的视觉-语言应用至关重要。然而，现有的大型视觉-语言模型（LVLMs）因处理大量视觉标记而产生了巨大的计算开销。随着“图像思维”模型的出现，推理已从文本领域扩展至视觉领域。这一能力促使我们开发了两阶段“由粗到细”的推理流程：首先，对下采样图像进行分析以识别任务相关区域；随后，仅对这些区域进行全分辨率裁剪并在后续推理阶段处理。该方法在降低计算成本的同时，保留了必要的细粒度视觉细节。一个主要挑战在于推断哪些区域真正与给定查询相关。最近的相关方法在输入图像下采样后的第一阶段常常失败，原因在于感知驱动的推理需要清晰的视觉信息才能有效进行。为解决这一问题，我们提出了ERGO（高效推理与引导观察），它执行推理驱动的感知，利用多模态上下文来确定关注点。我们的模型能够考虑感知不确定性，扩展裁剪区域以覆盖视觉模糊区域，从而回答问题。为此，我们在强化学习框架中开发了简单而有效的奖励组件，用于实现由粗到细的感知。在多个数据集上，我们的方法不仅比原始模型和竞争方法具有更高的准确性，而且效率更高。例如，ERGO在V*基准测试中超越了Qwen2.5-VL-7B，得分高出4.7分，同时仅使用了23%的视觉标记，实现了3倍的推理加速。代码和模型可在以下网址找到：https://github.com/nota-github/ERGO。

English

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

ERGO：面向视觉-语言模型的高效高分辨率视觉理解

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

摘要

Support