ERGO: 비전-언어 모델을 위한 고해상도 시각 이해의 효율적 접근

초록

고해상도 이미지의 효율적인 처리는 실제 세계의 시각-언어 응용 프로그램에 있어서 매우 중요합니다. 그러나 기존의 대형 시각-언어 모델(LVLMs)은 많은 수의 시각 토큰으로 인해 상당한 계산 오버헤드를 발생시킵니다. "이미지로 생각하기" 모델의 등장으로, 이제 추론은 텍스트를 넘어 시각 영역으로 확장되었습니다. 이러한 능력은 우리의 두 단계 "거친-세밀" 추론 파이프라인을 동기 부여합니다: 먼저, 다운샘플링된 이미지를 분석하여 작업과 관련된 영역을 식별하고, 그런 다음 이 영역만 전체 해상도로 잘라내어 후속 추론 단계에서 처리합니다. 이 접근 방식은 필요한 경우 세밀한 시각적 세부 사항을 보존하면서 계산 비용을 줄입니다. 주요 도전 과제는 주어진 쿼리에 대해 실제로 관련된 영역을 추론하는 데 있습니다. 최근의 관련 방법들은 종종 입력 이미지 다운샘플링 후 첫 번째 단계에서 실패하는데, 이는 명확한 시각 정보가 효과적인 추론을 위해 필요한 지각 중심의 추론 때문입니다. 이 문제를 해결하기 위해, 우리는 ERGO(Efficient Reasoning & Guided Observation)를 제안합니다. ERGO는 다중 모드 컨텍스트를 활용하여 어디에 초점을 맞출지 결정하는 추론 중심의 지각을 수행합니다. 우리의 모델은 지각적 불확실성을 고려하여, 질문에 답하기 위해 시각적으로 모호한 영역을 포함하도록 잘라낸 영역을 확장할 수 있습니다. 이를 위해, 우리는 거친-세밀 지각을 위한 강화 학습 프레임워크에서 간단하지만 효과적인 보상 구성 요소를 개발했습니다. 여러 데이터셋에서, 우리의 접근 방식은 원래 모델과 경쟁적인 방법들보다 더 높은 정확도를 제공하며, 더 큰 효율성을 달성합니다. 예를 들어, ERGO는 V* 벤치마크에서 Qwen2.5-VL-7B를 4.7점 앞서며, 시각 토큰의 23%만 사용하여 3배의 추론 속도 향상을 달성했습니다. 코드와 모델은 https://github.com/nota-github/ERGO에서 확인할 수 있습니다.

English

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

ERGO: 비전-언어 모델을 위한 고해상도 시각 이해의 효율적 접근

ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

초록

Support