XR：面向组合图像检索的跨模态智能体

摘要

智能体AI正在重新定义检索技术，这要求超越传统基于相似度的范式，实现多模态推理。组合图像检索（CIR） exemplifies 这一变革，其每个查询都结合了参考图像与文本修改指令，需要跨模态的组合理解能力。虽然基于嵌入的CIR方法已取得进展，但其视角仍显局限——仅能捕捉有限的跨模态线索且缺乏语义推理能力。为突破这些限制，我们提出XR框架：一种无需训练的多智能体系统，将检索重构为渐进式协同推理过程。该系统协调三类专业智能体：想象智能体通过跨模态生成合成目标表征，相似性智能体通过混合匹配进行粗筛选，提问智能体通过针对性推理验证事实一致性以实现精筛选。通过渐进式多智能体协作，XR能迭代优化检索结果以满足语义与视觉的双重查询约束，在FashionIQ、CIRR和CIRCO数据集上相较强力的无训练及有训练基线方法提升达38%，消融实验证实各智能体均不可或缺。代码已开源：https://01yzzyu.github.io/xr.github.io/。

English

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.

XR：面向组合图像检索的跨模态智能体

XR: Cross-Modal Agents for Composed Image Retrieval

摘要

Support