SPARC: VLMsのテスト時スケーリングのための知覚回路と推論回路の分離

要旨

近年の成功にもかかわらず、推論時に必要に応じて動的にトークン予算を拡張する「テストタイムスケーリング」は、視覚言語モデル（VLM）において依然として脆弱である。画像に対する非構造化された連鎖的思考（チェーン・オブ・ソート）は知覚と推論を混在させ、冗長で整理されていない文脈を生み出し、小さな知覚ミスが連鎖的に完全に誤った答えにつながる可能性がある。さらに、優れた性能を達成するには、手作りの報酬を用いた高コストな強化学習が必要とされる。本論文では、視覚的知覚と推論を明示的に分離するモジュール型フレームワークであるSPARC（Separating Perception And Reasoning Circuits）を提案する。脳における連続的な感覚-認知処理にヒントを得て、SPARCは2段階のパイプラインを実装する。まずモデルは明示的な視覚探索を行い質問に関連する領域を特定し、その後、それらの領域を条件として推論を行い最終的な答えを生成する。この分離により、非対称な計算リソース割り当てによる独立したテストタイムスケーリング（例：分布シフト下では知覚処理を優先）、選択的最適化（例：エンドツーエンド性能のボトルネックが知覚段階である場合にその部分のみを改善）、圧縮された文脈への対応（低解像度で大域的な探索を行い、選択された領域にのみ高解像度処理を割り当てることで、視覚トークン総数と計算量を削減）が可能となる。難易度の高い視覚推論ベンチマークにおいて、SPARCは単一のベースラインモデルや強力な視覚的接地アプローチを上回る性能を示した。例えば、SPARCはV^* VQAベンチマークにおいてQwen3VL-4Bの精度を6.7ポイント向上させ、困難なOODタスクでは「画像を用いた思考」アプローチを200分の1のトークン予算で4.6ポイント上回った。

English

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the V^* VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200times lower token budget.

SPARC: VLMsのテスト時スケーリングのための知覚回路と推論回路の分離

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

要旨

Support