RoboMME: ロボット汎用ポリシーのためのメモリベンチマークと理解

要旨

メモリは、長期的な視点と履歴依存型のロボットマニピュレーションにおいて極めて重要である。このようなタスクでは、繰り返し動作のカウントや、一時的に遮蔽される物体の操作が頻繁に含まれる。近年の視覚言語行動（VLA）モデルはメモリ機構の組み込みを始めているが、その評価は限定的で非標準化された環境に留まっている。これにより、体系的な理解、比較、進捗測定が制限されている。これらの課題に対処するため、我々はRoboMMEを提案する：長期的で履歴依存のシナリオにおけるVLAモデルの評価と発展のための大規模標準化ベンチマークである。本ベンチマークは、時間的、空間的、物体的、手順的メモリを評価するよう注意深く設計された分類法に基づいて構築された16のマニピュレーションタスクで構成される。さらに、π0.5バックボーン上に構築された14のメモリ拡張VLAバリアントを開発し、複数の統合戦略にわたる様々なメモリ表現を体系的に検証した。実験結果から、メモリ表現の有効性はタスクに強く依存し、各設計が異なるタスクにおいて独自の利点と限界を示すことが明らかになった。動画とコードは当ウェブサイト https://robomme.github.io で公開されている。

English

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

RoboMME: ロボット汎用ポリシーのためのメモリベンチマークと理解

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

要旨

Support