ChatPaper.aiChatPaper

SPARC:感知与推理电路分离实现视觉语言模型测试时扩展

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

February 6, 2026
作者: Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti
cs.AI

摘要

尽管近期取得进展,但测试时动态扩展——即在推理过程中按需动态增加token预算——对视觉语言模型(VLM)而言仍显脆弱:基于图像的零散思维链会将感知与推理纠缠在一起,导致生成冗长混乱的上下文,其中微小的感知错误可能引发答案的完全错误。此外,现有方法需依赖人工设计奖励的昂贵强化学习才能获得良好性能。本文提出SPARC(感知与推理电路分离)这一模块化框架,显式解耦视觉感知与推理过程。受大脑序列化感觉-认知处理机制启发,SPARC采用两阶段流程:模型先执行显式视觉搜索以定位问题相关区域,随后基于这些区域进行条件推理生成最终答案。这种分离机制支持非对称计算资源的独立测试时扩展(如在分布偏移时优先增强感知处理),允许选择性优化(当感知阶段成为端到端性能瓶颈时可单独改进),并能通过低分辨率全局搜索配合高分辨率局部处理压缩上下文,从而减少视觉token总量与计算开销。在多项挑战性视觉推理基准测试中,SPARC均优于单体基线模型与强视觉定位方法。例如在V^* VQA基准上,SPARC将Qwen3VL-4B的准确率提升6.7个百分点;在挑战性OOD任务中,其表现较"图像思维"方法高出4.6分,而所需token预算仅为后者的1/200。
English
Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the V^* VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200times lower token budget.
PDF32March 16, 2026