효율적인 시각 언어 모델을 위한 고해상도 작물 검색: 핵심 영역에 집중하기

초록

비전-언어 모델(VLM)은 일반적으로 원본 고해상도 이미지를 처리하므로 정확도와 계산 효율성 사이의 트레이드오프를 강요합니다. 고해상도 입력은 세부 사항을 잘 포착하지만 상당한 계산 비용이 발생하는 반면, 저해상도 입력은 효율성을 추구하지만 작은 텍스트 같은 중요한 시각 정보를 놓칠 가능성이 있습니다. 본 논문에서는 저해상도 전역 뷰를 기반으로 작동하며 주어진 질의에 필요한 고해상도 영역만 도구 호출을 통해 검색하는 온디맨드 공간 프레임워크인 AwaRes를 제시합니다. 우리는 지도 데이터를 자동으로 구축합니다: 저해상도와 고해상도 답변을 판단 모델이 비교하여 크롭 필요 여부를 레이블링하고, 오라클 기반 접근 모델이 정답의 근거를 위치시킨 후 이를 이산적 크롭 집합에 매핑하여 다중 턴 도구 사용 궤적을 형성합니다. 콜드-스타트 SFT 후 시맨틱 답변 정확도와 명시적 크롭 비용 패널티를 결합한 복합 보상으로 다중 턴 GRPO를 수행하여 프레임워크를 학습합니다. 프로젝트 페이지: https://nimrodshabtay.github.io/AwaRes

English

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

효율적인 시각 언어 모델을 위한 고해상도 작물 검색: 핵심 영역에 집중하기

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

초록

Support