ChatPaper.aiChatPaper

HyperEyes:面向并行多模态搜索智能体的双粒度效率感知强化学习

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

May 8, 2026
作者: Guankai Li, Jiabin Chen, Yi Xu, Xichen Zhang, Yuan Lu
cs.AI

摘要

现有多模态搜索代理在处理目标实体时采用顺序执行方式,每当查询分解为独立子检索任务时,每个实体需触发一次工具调用,导致冗余交互轮次累积。我们认为,高效的多模态代理应追求"更宽"而非"更长"的搜索策略:即在单轮交互中并行发出多个有依据的查询。为此,我们提出HyperEyes——一种并行多模态搜索代理,它将视觉定位与检索融合为单一原子操作,支持对多个实体进行并发搜索,同时将推理效率视为一等训练目标。HyperEyes采用两阶段训练。在冷启动监督阶段,我们构建了并行兼容数据合成管道,涵盖视觉多实体和文本多约束查询,并通过渐进拒绝采样生成面向效率的轨迹。在此基础上,本研究的核心贡献——双粒度效率感知强化学习框架——在两个层级运作。宏观层面,我们提出TRACE(工具使用参考自适应成本效率)轨迹级奖励,其参考值在训练过程中单调收紧,以抑制多余工具调用且不限制真实多跳搜索。微观层面,我们采用在策略蒸馏,从外部教师模型向失败展开注入密集的令牌级修正信号,缓解稀疏结果奖励带来的信用分配问题。鉴于现有基准仅以准确率作为唯一评估指标而忽略推理成本,我们构建了IMEB——包含300个人工标注实例的基准,联合评估搜索能力与效率。在六个基准测试中,HyperEyes-30B相比最强可比开源代理,准确率提升9.9%,平均工具调用轮次减少5.3倍。
English
Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.
PDF571May 12, 2026