RoboMME:机器人通用策略的记忆基准测试与理解
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
March 4, 2026
作者: Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai
cs.AI
摘要
记忆能力对于长周期且依赖历史操作的机器人操控任务至关重要。这类任务常涉及重复动作计数或处理暂时被遮挡的物体。近期出现的视觉-语言-动作模型已开始引入记忆机制,但其评估仍局限于狭窄的非标准化场景,这限制了对模型的系统性理解、比较与进展衡量。为应对这些挑战,我们推出RoboMME:一个用于评估和推进VLA模型在长周期历史依赖场景中表现的大规模标准化基准。该基准包含基于精心设计分类法构建的16项操控任务,可评估时序记忆、空间记忆、物体记忆与流程记忆。我们进一步基于π0.5主干网络开发了包含14种记忆增强型VLA变体的测试套件,系统探索了多种集成策略下的不同记忆表征。实验结果表明,记忆表征的有效性高度依赖具体任务,不同设计在各任务中均展现出独特的优势与局限。视频及代码详见项目网站https://robomme.github.io。
English
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.