STORM:面向机器人操作的基于槽位的任务感知物体中心表征
STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation
January 28, 2026
作者: Alexandre Chapin, Emmanuel Dellandréa, Liming Chen
cs.AI
摘要
视觉基础模型为机器人技术提供了强大的感知特征,但其稠密表征缺乏显式的物体级结构,限制了操作任务中的鲁棒性和可收缩性。我们提出STORM(基于槽位的任务感知物体中心表征),这是一种轻量级物体中心适配模块,通过为冻结的视觉基础模型添加少量语义感知槽位来增强机器人操作能力。与重新训练大型骨干网络不同,STORM采用多阶段训练策略:首先通过使用语言嵌入的视觉-语义预训练来稳定物体中心槽位,然后与下游操作策略联合适配。这种分阶段学习能防止槽位退化形成,在将感知与任务目标对齐的同时保持语义一致性。在物体发现基准测试和模拟操作任务上的实验表明,相较于直接使用冻结基础模型特征或端到端训练物体中心表征,STORM能显著提升对视觉干扰物的泛化能力和控制性能。我们的研究结果凸显了多阶段适配作为一种高效机制,可将通用基础模型特征转化为适用于机器人控制的任务感知型物体中心表征。
English
Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.