ChatPaper.aiChatPaper

發現並運用斯佩爾克片段

Discovering and using Spelke segments

July 21, 2025
作者: Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel Yamins
cs.AI

摘要

在計算機視覺中,圖像分割通常基於語義考量,並高度依賴於特定類別的慣例。相比之下,發展心理學表明,人類是以斯佩爾克物體(Spelke objects)來感知世界的——這些物體是由物理力作用下可靠地共同移動的物理事物組成的群體。因此,斯佩爾克物體基於與類別無關的因果運動關係,這可能更有利於支持如操作和規劃等任務。本文首先對斯佩爾克物體概念進行基準測試,引入了SpelkeBench數據集,該數據集包含自然圖像中多種定義明確的斯佩爾克分割。接著,為了從圖像中算法性地提取斯佩爾克分割,我們構建了SpelkeNet,這是一類視覺世界模型,訓練用於預測未來運動的分佈。SpelkeNet支持估計斯佩爾克物體發現的兩個關鍵概念:(1) 運動可能性圖,識別在戳刺下可能移動的區域;(2) 預期位移圖,捕捉場景中其他部分將如何移動。這些概念用於“統計反事實探測”,在具有高運動可能性的區域上應用多樣化的“虛擬戳刺”,並利用產生的預期位移圖將斯佩爾克分割定義為相關運動統計量的統計聚合。我們發現,SpelkeNet在SpelkeBench上的表現優於如SegmentAnything(SAM)等有監督基線模型。最後,我們展示了斯佩爾克概念在實際應用中的實用性,在多種現成的物體操作模型中使用時,在3DEditBench物理物體操作基準測試中表現出更優的性能。
English
Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects--groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for "statistical counterfactual probing", where diverse "virtual pokes" are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.
PDF72July 25, 2025