HyperEyes: 二重粒度効率考慮型強化学習による並列マルチモーダル検索エージェント

要旨

既存のマルチモーダル検索エージェントは対象エンティティを逐次的に処理し、エンティティごとに1回のツール呼び出しを行い、クエリが独立したサブ検索に分解されるたびに冗長な対話ラウンドを蓄積する。我々は、効果的なマルチモーダルエージェントは「長く」ではなく「広く」検索すべきであり、1ラウンド内で複数のグラウンディングされたクエリを同時に発行する必要があると主張する。この目的のために、我々はHyperEyesを提案する。これは、視覚的グラウンディングと検索を単一の原子的アクションに融合し、複数エンティティにわたる同時検索を可能にしつつ、推論効率を第一級のトレーニング目的として扱う並列マルチモーダル検索エージェントである。HyperEyesは2段階で訓練される。コールドスタートのための教師データとして、我々は並列適用可能なデータ合成パイプラインを開発し、視覚的多エンティティおよびテキスト的多制約クエリをカバーし、プログレッシブ棄却サンプリングを通じて効率指向の軌跡をキュレーションする。これに基づき、我々の中心的貢献である二重粒度効率認識強化学習フレームワークは、2つのレベルで動作する。マクロレベルでは、TRACE（ツール使用参照適応コスト効率）を提案する。これは、訓練中に参照が単調に厳格化される軌跡レベルの報酬であり、真のマルチホップ検索を制限することなく、不要なツール呼び出しを抑制する。ミクロレベルでは、オンポリシー蒸留を適用し、失敗したロールアウトに対して外部教師からの密なトークンレベルの修正信号を注入し、スパースな結果報酬によるクレジット割り当ての欠陥を緩和する。既存のベンチマークは精度を唯一の指標として評価し、推論コストを無視しているため、我々はIMEBを導入する。これは、検索能力と効率を同時に評価する、人間がキュレーションした300インスタンスのベンチマークである。6つのベンチマークにおいて、HyperEyes-30Bは、最も強力な比較可能なオープンソースエージェントを精度で9.9%上回り、平均ツール呼び出しラウンド数は5.3分の1である。

English

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent sub-retrievals. We argue that effective multimodal agents should search wider rather than longer: dispatching multiple grounded queries concurrently within a round. To this end, we present HyperEyes, a parallel multimodal search agent that fuses visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities while treating inference efficiency as a first-class training objective. HyperEyes is trained in two stages. For cold-start supervision, we develop a Parallel-Amenable Data Synthesis Pipeline covering visual multi-entity and textual multi-constraint queries, curating efficiency-oriented trajectories via Progressive Rejection Sampling. Building on this, our central contribution, a Dual-Grained Efficiency-Aware Reinforcement Learning framework, operates at two levels. At the macro level, we propose TRACE (Tool-use Reference-Adaptive Cost Efficiency), a trajectory-level reward whose reference is monotonically tightened during training to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, we adapt On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards. Since existing benchmarks evaluate accuracy as the sole metric, omitting inference cost, we introduce IMEB, a human-curated benchmark of 300 instances that jointly evaluates search capability and efficiency. Across six benchmarks, HyperEyes-30B surpasses the strongest comparable open-source agent by 9.9% in accuracy with 5.3x fewer tool-call rounds on average.

HyperEyes: 二重粒度効率考慮型強化学習による並列マルチモーダル検索エージェント

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

要旨

Support