MLLM作为检索器：为具身代理交互式学习多模态检索

摘要

MLLM代理通过检索多模态任务相关轨迹数据展示了完成复杂具身任务的潜力。然而，当前的检索方法主要集中在轨迹中文本或视觉线索的表面相似性上，忽视了它们对特定任务的有效性。为了解决这一问题，我们提出了一种新颖的方法，即MLLM作为检索器（MART），通过利用交互数据来微调基于偏好学习的MLLM检索器，从而使检索器充分考虑轨迹的有效性并为未知任务优先考虑它们。我们还引入了轨迹抽象，这是一种利用MLLM的总结能力来表示轨迹的机制，使用更少的标记表示关键信息，使代理能够更好地理解轨迹中的里程碑。在各种环境中的实验结果显示，我们的方法显著提高了在未知场景中的任务成功率，与基线方法相比。这项工作提出了一个新的多模态检索范式，通过微调通用的MLLM作为检索器来评估轨迹的有效性。所有基准任务集和模拟器代码修改将会发布。

English

MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM as ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritize them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All benchmark task sets and simulator code modifications for action and observation spaces will be released.

MLLM作为检索器：为具身代理交互式学习多模态检索

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

摘要

Support