假设:通过稀疏交互理解运动
What If : Understanding Motion Through Sparse Interactions
October 14, 2025
作者: Stefan Andreas Baumann, Nick Stracke, Timy Phan, Björn Ommer
cs.AI
摘要
理解物理场景的动态性涉及对其可能发生变化的多种方式进行推理,尤其是由局部相互作用引发的变化。我们提出了Flow Poke Transformer(FPT),这是一种新颖的框架,用于直接预测局部运动的分布,条件是基于被称为“戳动”的稀疏交互。与通常仅能密集采样单一场景动态实现的传统方法不同,FPT提供了一种可解释且直接访问的多模态场景运动表示,包括其对物理交互的依赖以及场景动态固有的不确定性。我们还通过多个下游任务评估了我们的模型,以便与现有方法进行比较,并突出我们方法的灵活性。在密集面部运动生成任务中,我们的通用预训练模型超越了专门的基线。FPT可以在强分布外任务(如合成数据集)中进行微调,从而在关节物体运动估计方面实现相对于领域内方法的显著改进。此外,直接预测显式运动分布使我们的方法在诸如基于戳动的移动部件分割等任务上取得了具有竞争力的性能,进一步展示了FPT的多功能性。代码和模型已在https://compvis.github.io/flow-poke-transformer公开提供。
English
Understanding the dynamics of a physical scene involves reasoning about the
diverse ways it can potentially change, especially as a result of local
interactions. We present the Flow Poke Transformer (FPT), a novel framework for
directly predicting the distribution of local motion, conditioned on sparse
interactions termed "pokes". Unlike traditional methods that typically only
enable dense sampling of a single realization of scene dynamics, FPT provides
an interpretable directly accessible representation of multi-modal scene
motion, its dependency on physical interactions and the inherent uncertainties
of scene dynamics. We also evaluate our model on several downstream tasks to
enable comparisons with prior methods and highlight the flexibility of our
approach. On dense face motion generation, our generic pre-trained model
surpasses specialized baselines. FPT can be fine-tuned in strongly
out-of-distribution tasks such as synthetic datasets to enable significant
improvements over in-domain methods in articulated object motion estimation.
Additionally, predicting explicit motion distributions directly enables our
method to achieve competitive performance on tasks like moving part
segmentation from pokes which further demonstrates the versatility of our FPT.
Code and models are publicly available at
https://compvis.github.io/flow-poke-transformer.