MLLM作為檢索器：為具體代理人互動學習多模態檢索

摘要

MLLM代理通過檢索多模式任務相關的軌跡數據展示了對複雜具體任務的潛力。然而，目前的檢索方法主要集中在軌跡中文本或視覺提示的表面相似性上，忽略了它們對特定任務的有效性。為了解決這個問題，我們提出了一種新方法，即MLLM作為檢索器（MART），通過利用交互數據來微調基於偏好學習的MLLM檢索器，以使檢索器充分考慮軌跡的有效性並為未知任務優先考慮它們。我們還引入了軌跡抽象，這是一種利用MLLM的總結能力來用較少的標記表示軌跡並保留關鍵信息的機制，從而使代理更好地理解軌跡中的里程碑。在各種環境中的實驗結果顯示，我們的方法顯著提高了在未知場景中的任務成功率，與基準方法相比。這項工作提出了一種新的多模式檢索範式，通過微調通用MLLM作為檢索器來評估軌跡的有效性。所有基準任務集和模擬器代碼修改將被釋放。

English

MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM as ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritize them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All benchmark task sets and simulator code modifications for action and observation spaces will be released.

MLLM作為檢索器：為具體代理人互動學習多模態檢索

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

摘要

Support