ChatPaper.aiChatPaper

專注關鍵區域:高效視覺語言模型的高解析度圖像檢索策略

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

March 14, 2026
作者: Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz
cs.AI

摘要

視覺語言模型通常以原生高解析度處理影像,這迫使模型必須在準確性與計算效率之間做出權衡:高解析度輸入能捕捉細部特徵卻伴隨顯著計算成本,而低解析度輸入雖提升效率,卻可能遺失關鍵視覺資訊(如細小文字)。我們提出AwaRes——一種按需取樣的空間處理框架,透過在低解析度全域視野上運作,並利用工具呼叫機制僅擷取查詢所需的高解析度區塊,從而化解此準確性與效率的衝突。我們採用自動化監督資料建構方法:由評判模組比對低/高解析度答案以標註是否需要裁剪區域,再由標定模組定位正確答案的證據區域,並將其映射至離散裁剪集合以形成多輪次工具使用軌跡。訓練框架採用冷啟動監督微調,接續進行多輪次群體策略優化,其複合獎勵函數結合語意答案正確性與顯性裁剪成本懲罰機制。專案頁面:https://nimrodshabtay.github.io/AwaRes
English
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
PDF662March 25, 2026