MLLM를 검색기로 사용: 태체 에이전트를 위한 상호작용적인 다중 모달 검색 학습

초록

MLLM 에이전트는 다중 모달 작업 관련 궤적 데이터를 검색함으로써 복잡한 신체 작업에 대한 잠재력을 보여줍니다. 그러나 현재의 검색 방법은 주로 궤적에서의 텍스트 또는 시각적 단서의 표면 수준 유사성에 초점을 맞추고 있으며, 해당 작업에 대한 효과를 무시합니다. 이 문제를 해결하기 위해 preference learning을 기반으로 한 MLLM 검색기를 세밀하게 조정하여 MLLM을 ReTriever로 사용하는 새로운 방법, MART를 제안합니다. 이를 통해 검색기는 궤적의 효과를 완전히 고려하고 보이지 않는 작업에 대해 우선 순위를 매길 수 있습니다. 또한 궤적 요약을 소개하는데, 이는 MLLM의 요약 능력을 활용하여 핵심 정보를 보존하면서 더 적은 토큰으로 궤적을 표현함으로써 에이전트가 궤적의 중요 지점을 더 잘 이해할 수 있도록 합니다. 다양한 환경에서의 실험 결과는 우리의 방법이 기존 방법에 비해 보이지 않는 장면에서의 작업 성공률을 크게 향상시킨다는 것을 보여줍니다. 이 연구는 일반적인 목적의 MLLM을 검색기로 세밀하게 조정하여 궤적 효과를 평가하는 것을 통해 신체 에이전트에서의 다중 모달 검색을 위한 새로운 패러다임을 제시합니다. 모든 벤치마크 작업 세트 및 액션 및 관측 공간에 대한 시뮬레이터 코드 수정은 공개될 예정입니다.

English

MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at hand. To address this issue, we propose a novel method, MLLM as ReTriever (MART), which enhances the performance of embodied agents by utilizing interaction data to fine-tune an MLLM retriever based on preference learning, such that the retriever fully considers the effectiveness of trajectories and prioritize them for unseen tasks. We also introduce Trajectory Abstraction, a mechanism that leverages MLLMs' summarization capabilities to represent trajectories with fewer tokens while preserving key information, enabling agents to better comprehend milestones in the trajectory. Experimental results across various environments demonstrate our method significantly improves task success rates in unseen scenes compared to baseline methods. This work presents a new paradigm for multimodal retrieval in embodied agents, by fine-tuning a general-purpose MLLM as the retriever to assess trajectory effectiveness. All benchmark task sets and simulator code modifications for action and observation spaces will be released.

MLLM를 검색기로 사용: 태체 에이전트를 위한 상호작용적인 다중 모달 검색 학습

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

초록

Support