聚焦关键区域:面向高效视觉语言模型的高分辨率图像检索
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
March 14, 2026
作者: Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
cs.AI
摘要
视觉语言模型通常以原生高分辨率处理图像,这迫使模型在精度与计算效率之间做出权衡:高分辨率输入能捕捉细微细节但计算成本高昂,低分辨率输入虽提升效率却可能遗漏关键视觉信息(如小尺寸文本)。我们提出AwaRes框架,通过空间按需调用机制解决这一矛盾——系统在低分辨率全局视图上运行,并通过工具调用仅检索查询所需的高分辨率局部区域。我们采用自动化监督数据构建方法:通过比对低/高分辨率答案的评判机制标记是否需要局部裁剪,利用 grounding 预言模型定位正确答案的证据区域,并将其映射至离散裁剪集合以形成多轮工具使用轨迹。训练流程包含冷启动SFT和带复合奖励的多轮GRPO,其中奖励函数综合了语义答案准确性与显式裁剪成本惩罚。项目页面:https://nimrodshabtay.github.io/AwaRes
English
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes