单次点击生成交互式全景视频场景图:Click2Graph技术
Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
November 20, 2025
作者: Raphael Ruschel, Hardikkumar Prajapati, Awsafur Rahman, B. S. Manjunath
cs.AI
摘要
当前最先进的视频场景图生成(VSGG)系统虽能提供结构化视觉理解,但采用封闭式前馈流程运行,无法融入人工引导。相比之下,SAM2等可提示分割模型支持精确的用户交互,却缺乏语义或关系推理能力。我们提出Click2Graph——首个面向全景视频场景图生成(PVSG)的交互式框架,将视觉提示与空间、时间和语义理解相融合。仅需用户一次点击或边界框提示,Click2Graph即可跨时段分割追踪目标主体,自主发现交互对象,并预测<主体,客体,谓词>三元组以构建时序一致的场景图。该框架包含两大核心组件:生成主体条件化对象提示的动态交互发现模块,以及执行联合实体与谓词推理的语义分类头。在OpenPVSG基准测试中的实验表明,Click2Graph为用户引导式PVSG奠定了坚实基础,展现了如何将人工提示与全景定位及关系推理相结合,实现可控可解释的视频场景理解。
English
State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.