ChatPaper.aiChatPaper

SpecEyes:通过推测式感知与规划加速具身多模态大语言模型

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

March 24, 2026
作者: Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo
cs.AI

摘要

智能体式多模态大语言模型(如OpenAI o3与Gemini Agentic Vision)通过迭代式视觉工具调用实现了卓越的推理能力。然而级联式的感知、推理与工具调用循环会带来显著的串行开销。这种被称为"智能体深度"的开销会产生过高延迟,严重限制系统级并发性能。为此,我们提出SpecEyes——一种智能体级推测加速框架,旨在突破这一串行瓶颈。我们的核心思路是:利用轻量级无工具MLLM作为推测规划器,通过预测执行轨迹实现昂贵工具链的提前终止,且不损失准确性。为规范这种推测规划,我们引入了基于答案可分离性的认知门控机制,该机制可在无需标注数据的情况下量化模型的自验证置信度。此外,我们设计了异构并行漏斗架构,利用小模型的无状态并发特性来掩盖大模型的有状态串行执行,从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明,SpecEyes在保持甚至提升准确率(最高+6.7%)的同时,相较智能体基线实现了1.1-3.35倍加速,显著提升了并发工作负载下的服务吞吐量。
English
Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.
PDF422March 26, 2026