HyperEyes：面向并行多模态搜索智能体的双粒度效率感知强化学习

摘要

现有多模态搜索代理在处理目标实体时采用顺序执行方式，每当查询分解为独立子检索任务时，每个实体需触发一次工具调用，导致冗余交互轮次累积。我们认为，高效的多模态代理应追求"更宽"而非"更长"的搜索策略：即在单轮交互中并行发出多个有依据的查询。为此，我们提出HyperEyes——一种并行多模态搜索代理，它将视觉定位与检索融合为单一原子操作，支持对多个实体进行并发搜索，同时将推理效率视为一等训练目标。HyperEyes采用两阶段训练。在冷启动监督阶段，我们构建了并行兼容数据合成管道，涵盖视觉多实体和文本多约束查询，并通过渐进拒绝采样生成面向效率的轨迹。在此基础上，本研究的核心贡献——双粒度效率感知强化学习框架——在两个层级运作。宏观层面，我们提出TRACE（工具使用参考自适应成本效率）轨迹级奖励，其参考值在训练过程中单调收紧，以抑制多余工具调用且不限制真实多跳搜索。微观层面，我们采用在策略蒸馏，从外部教师模型向失败展开注入密集的令牌级修正信号，缓解稀疏结果奖励带来的信用分配问题。鉴于现有基准仅以准确率作为唯一评估指标而忽略推理成本，我们构建了IMEB——包含300个人工标注实例的基准，联合评估搜索能力与效率。在六个基准测试中，HyperEyes-30B相比最强可比开源代理，准确率提升9.9%，平均工具调用轮次减少5.3倍。

English

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

HyperEyes：面向并行多模态搜索智能体的双粒度效率感知强化学习

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

摘要

Support