SpecEyes: 投機的知覚と計画によるエージェント型マルチモーダルLLMの高速化

要旨

エージェント型マルチモーダル大規模言語モデル（MLLM）（例：OpenAI o3、Gemini Agentic Vision）は、視覚的ツール呼び出しの反復的な実行により顕著な推論能力を実現している。しかし、認識・推論・ツール呼び出しの連鎖的なループは、重大な逐次処理オーバーヘッドを伴う。このオーバーヘッドは「エージェンシック深度」と呼ばれ、許容不能な遅延を招き、システムレベルの並行性を深刻に制限する。そこで本論文では、この逐次処理のボトルネックを打破するエージェントレベルの投機的加速フレームワーク「SpecEyes」を提案する。我々の重要な洞察は、軽量でツール非依存のMLLMが投機的プランナーとして機能し、高コストなツールチェーンの早期終了を可能にする実行軌道を予測できる点にある。この投機的計画を制御するため、回答分離性に基づく認知ゲーティング機構を導入する。これは、正解ラベルを必要とせずにモデルの自信を定量化し、自己検証を実現する。さらに、軽量モデルのステートレス並行性を活用して大規模モデルのステートフル逐次実行を隠蔽する異種並列ファネルを設計し、システムスループットを最大化する。V* Bench、HR-Bench、POPEを用いた大規模実験により、SpecEyesはエージェントベースラインに対し精度を維持あるいは最大6.7%向上させつつ、1.1～3.35倍の高速化を達成し、並行ワークロード下でのサービススループットを向上させることを実証した。

English

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

SpecEyes: 投機的知覚と計画によるエージェント型マルチモーダルLLMの高速化

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

要旨

Support