具身代理與個人化相遇：探索記憶利用於個人化協助

摘要

由大型語言模型（LLMs）驅動的具身代理在家庭物品重排任務中展現了強大的性能。然而，這些任務主要聚焦於簡化指令的單輪互動，未能真正反映提供有意義用戶協助的挑戰。為提供個性化協助，具身代理必須理解用戶賦予物理世界的獨特語義（例如，最喜歡的杯子、早餐習慣），並利用先前的互動歷史來解讀動態的現實世界指令。然而，具身代理在利用記憶提供個性化協助方面的有效性仍大多未被充分探索。為填補這一空白，我們提出了MEMENTO，一個旨在全面評估記憶利用能力以提供個性化協助的具身代理評估框架。我們的框架包含一個兩階段的記憶評估流程設計，能夠量化記憶利用對任務表現的影響。這一流程通過聚焦於其在目標解讀中的作用，評估代理在物品重排任務中對個性化知識的理解：（1）基於個人意義識別目標物品的能力（物品語義），以及（2）從用戶一致模式（如日常習慣）推斷物品位置配置的能力（用戶模式）。我們對多種LLMs的實驗揭示了記憶利用的顯著限制，即使是像GPT-4o這樣的前沿模型，在需要參考多個記憶時，尤其是在涉及用戶模式的任務中，性能下降了30.5%。這些發現，連同我們的詳細分析和案例研究，為未來開發更有效的個性化具身代理提供了寶貴的見解。項目網站：https://connoriginal.github.io/MEMENTO

English

Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO