SPARC: VLMs의 테스트 타임 스케일링을 위한 인지와 추론 회로 분리

초록

최근 성과에도 불구하고, 시각-언어 모델(VLM)에 대한 추론 시 동적 토큰 예산 확장(test-time scaling)은 여전히 취약한 실정이다. 이미지에 대한 비구조화된 사고 연쇄(chain-of-thought)는 인지와 추론을 뒤섞어, 사소한 인지 오류가 완전히 잘못된 답변으로 이어질 수 있는 길고 산만한 문맥을 초래한다. 또한 우수한 성능을 달성하려면 수작업으로 설계된 보상을 활용한 고비용 강화 학습이 필요하다. 본 논문에서는 시각 인지와 추론을 명시적으로 분리하는 모듈식 프레임워크인 SPARC(Separating Perception And Reasoning Circuits)를 소개한다. 뇌의 순차적 감각-인지 처리 과정에서 영감을 받은 SPARC는 모델이 먼저 명시적 시각 탐색을 수행하여 질문 관련 영역을 파악한 후, 해당 영역을 기반으로 추론을 진행하여 최종 답변을 도출하는 2단계 파이프라인을 구현한다. 이러한 분리는 비대칭 계산 자원 할당(예: 분포 변화 시 인지 처리 우선순위 지정)을 통한 독립적인 추론 시 확장을 가능하게 하며, 선택적 최적화(예: 종단간 성능의 병목 현상이 인지 단계일 경우 해당 단계만 개선)를 지원한다. 또한 전역 탐색은 낮은 이미지 해상도로 수행하고 선택된 영역에만 고해상도 처리를 할당함으로써 문맥을 압축하여 전체 시각 토큰 수와 계산량을 줄인다. 다양한 까다로운 시각 추론 벤치마크에서 SPARC는 일체형(monolithic) 기준 모델과 강력한 시각 기반(visual-grounding) 접근법을 능가했다. 예를 들어, SPARC는 V^* VQA 벤치마크에서 Qwen3VL-4B의 정확도를 6.7%p 향상시켰으며, 어려운 OOD 과제에서 "thinking with images" 방법보다 토큰 예산을 200분의 1만 사용하면서도 정확도가 4.6%p 높았다.

English

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the V^* VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200times lower token budget.

SPARC: VLMs의 테스트 타임 스케일링을 위한 인지와 추론 회로 분리

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

초록

Support