發現並運用斯佩爾克片段
Discovering and using Spelke segments
July 21, 2025
作者: Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel Yamins
cs.AI
摘要
在計算機視覺中,圖像分割通常基於語義考量,並高度依賴於特定類別的慣例。相比之下,發展心理學表明,人類是以斯佩爾克物體(Spelke objects)來感知世界的——這些物體是由物理力作用下可靠地共同移動的物理事物組成的群體。因此,斯佩爾克物體基於與類別無關的因果運動關係,這可能更有利於支持如操作和規劃等任務。本文首先對斯佩爾克物體概念進行基準測試,引入了SpelkeBench數據集,該數據集包含自然圖像中多種定義明確的斯佩爾克分割。接著,為了從圖像中算法性地提取斯佩爾克分割,我們構建了SpelkeNet,這是一類視覺世界模型,訓練用於預測未來運動的分佈。SpelkeNet支持估計斯佩爾克物體發現的兩個關鍵概念:(1) 運動可能性圖,識別在戳刺下可能移動的區域;(2) 預期位移圖,捕捉場景中其他部分將如何移動。這些概念用於“統計反事實探測”,在具有高運動可能性的區域上應用多樣化的“虛擬戳刺”,並利用產生的預期位移圖將斯佩爾克分割定義為相關運動統計量的統計聚合。我們發現,SpelkeNet在SpelkeBench上的表現優於如SegmentAnything(SAM)等有監督基線模型。最後,我們展示了斯佩爾克概念在實際應用中的實用性,在多種現成的物體操作模型中使用時,在3DEditBench物理物體操作基準測試中表現出更優的性能。
English
Segments in computer vision are often defined by semantic considerations and
are highly dependent on category-specific conventions. In contrast,
developmental psychology suggests that humans perceive the world in terms of
Spelke objects--groupings of physical things that reliably move together when
acted on by physical forces. Spelke objects thus operate on category-agnostic
causal motion relationships which potentially better support tasks like
manipulation and planning. In this paper, we first benchmark the Spelke object
concept, introducing the SpelkeBench dataset that contains a wide variety of
well-defined Spelke segments in natural images. Next, to extract Spelke
segments from images algorithmically, we build SpelkeNet, a class of visual
world models trained to predict distributions over future motions. SpelkeNet
supports estimation of two key concepts for Spelke object discovery: (1) the
motion affordance map, identifying regions likely to move under a poke, and (2)
the expected-displacement map, capturing how the rest of the scene will move.
These concepts are used for "statistical counterfactual probing", where diverse
"virtual pokes" are applied on regions of high motion-affordance, and the
resultant expected displacement maps are used define Spelke segments as
statistical aggregates of correlated motion statistics. We find that SpelkeNet
outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench.
Finally, we show that the Spelke concept is practically useful for downstream
applications, yielding superior performance on the 3DEditBench benchmark for
physical object manipulation when used in a variety of off-the-shelf object
manipulation models.