SpecEyes：通过推测式感知与规划加速具身多模态大语言模型

摘要

智能体式多模态大语言模型（如OpenAI o3与Gemini Agentic Vision）通过迭代式视觉工具调用实现了卓越的推理能力。然而级联式的感知、推理与工具调用循环会带来显著的串行开销。这种被称为"智能体深度"的开销会产生过高延迟，严重限制系统级并发性能。为此，我们提出SpecEyes——一种智能体级推测加速框架，旨在突破这一串行瓶颈。我们的核心思路是：利用轻量级无工具MLLM作为推测规划器，通过预测执行轨迹实现昂贵工具链的提前终止，且不损失准确性。为规范这种推测规划，我们引入了基于答案可分离性的认知门控机制，该机制可在无需标注数据的情况下量化模型的自验证置信度。此外，我们设计了异构并行漏斗架构，利用小模型的无状态并发特性来掩盖大模型的有状态串行执行，从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明，SpecEyes在保持甚至提升准确率（最高+6.7%）的同时，相较智能体基线实现了1.1-3.35倍加速，显著提升了并发工作负载下的服务吞吐量。

English

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.