HyperEyes：用於平行多模態搜索代理的雙粒度效率感知強化學習

摘要

現有的多模態搜尋代理以序列方式處理目標實體，每個實體發出一次工具調用，而當查詢分解為多個獨立的子檢索時，便會累積多餘的交互回合。我們認為，有效的多模態代理應該更廣泛而非更長時間地搜尋：在一個回合內同時發送多個具有基礎依據的查詢。為此，我們提出了 HyperEyes，這是一種並行多模態搜尋代理，將視覺基礎定位與檢索融合為一個單一的原子行動，能夠同時對多個實體進行搜尋，同時將推理效率視為首要的訓練目標。HyperEyes 分兩個階段進行訓練。在冷啟動監督方面，我們開發了一種適合並行的數據合成流程，涵蓋視覺多實體和文本多約束查詢，並通過漸進式拒絕抽樣來整理以效率為導向的軌跡。在此基礎上，我們的核心貢獻——一種雙粒度效率感知強化學習框架——在兩個層面上運作。在宏觀層面，我們提出了 TRACE（工具使用參考自適應成本效率），這是一種軌跡層級的獎勵，其參考值在訓練過程中單調收緊，以抑制多餘的工具調用，同時不限制真正的多跳搜尋。在微觀層面，我們調整了在線策略蒸餾，以從外部教師模型中注入密集的詞元級別糾正信號到失敗的執行軌跡中，從而緩解稀疏結果獎勵的信用分配不足問題。由於現有基準僅以準確率作為唯一指標，忽略了推理成本，我們引入了 IMEB，這是一個由人工整理的包含 300 個實例的基準，用於共同評估搜尋能力與效率。在六個基準測試中，HyperEyes-30B 在準確率上比最強的可比開源代理高出 9.9%，同時平均工具調用回合數減少了 5.3 倍。

English

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.