WorldLines：长时域有状态具身智能体的基准测试与建模

摘要

为了在真实家庭环境中长期协助人类，具身智能体必须掌握用户日常习惯、世界状态及过往交互信息。现有长期记忆基准主要评估以语言为中心的检索与问答能力，而具身基准则多聚焦于短周期任务执行，未能在动态环境中检验长期记忆的应用。我们提出WorldLines——一个面向长周期具身家庭辅助的项目驱动型基准。该基准构建了包含对话、动作、执行反馈、物体与设备状态变化的长时间维度家庭轨迹，并将其转化为关联证据的样本，用于记忆问答与具身任务规划。此外，我们提出ObsMem——一种以观察者为中心的框架，通过维护可见性感知记忆与动作原生状态轨迹，实现状态感知决策。实验揭示了局部可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划能力等方面的持续挑战，而ObsMem则为该场景提供了更具参考价值的架构方案。

English

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.