專注關鍵區域：高效視覺語言模型的高解析度圖像檢索策略

摘要

視覺語言模型通常以原生高解析度處理影像，這迫使模型必須在準確性與計算效率之間做出權衡：高解析度輸入能捕捉細部特徵卻伴隨顯著計算成本，而低解析度輸入雖提升效率，卻可能遺失關鍵視覺資訊（如細小文字）。我們提出AwaRes——一種按需取樣的空間處理框架，透過在低解析度全域視野上運作，並利用工具呼叫機制僅擷取查詢所需的高解析度區塊，從而化解此準確性與效率的衝突。我們採用自動化監督資料建構方法：由評判模組比對低/高解析度答案以標註是否需要裁剪區域，再由標定模組定位正確答案的證據區域，並將其映射至離散裁剪集合以形成多輪次工具使用軌跡。訓練框架採用冷啟動監督微調，接續進行多輪次群體策略優化，其複合獎勵函數結合語意答案正確性與顯性裁剪成本懲罰機制。專案頁面：https://nimrodshabtay.github.io/AwaRes

English

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

專注關鍵區域：高效視覺語言模型的高解析度圖像檢索策略

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

摘要

Support