聚焦关键区域：面向高效视觉语言模型的高分辨率图像检索

摘要

视觉语言模型通常以原生高分辨率处理图像，这迫使模型在精度与计算效率之间做出权衡：高分辨率输入能捕捉细微细节但计算成本高昂，低分辨率输入虽提升效率却可能遗漏关键视觉信息（如小尺寸文本）。我们提出AwaRes框架，通过空间按需调用机制解决这一矛盾——系统在低分辨率全局视图上运行，并通过工具调用仅检索查询所需的高分辨率局部区域。我们采用自动化监督数据构建方法：通过比对低/高分辨率答案的评判机制标记是否需要局部裁剪，利用 grounding 预言模型定位正确答案的证据区域，并将其映射至离散裁剪集合以形成多轮工具使用轨迹。训练流程包含冷启动SFT和带复合奖励的多轮GRPO，其中奖励函数综合了语义答案准确性与显式裁剪成本惩罚。项目页面：https://nimrodshabtay.github.io/AwaRes

English

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes