Click2Graph：單次點擊生成交互式全景視頻場景圖

摘要

当前最先进的视频场景图生成系统虽能提供结构化视觉理解，但其封闭的前馈式流程无法融入人工引导。而如SAM2等可提示分割模型虽支持精准用户交互，却缺乏语义或关系推理能力。我们提出Click2Graph——首个面向全景视频场景图生成（PVSG）的交互式框架，将视觉提示与空间、时间和语义理解相融合。该系统仅需用户一次点击或框选等简单提示，即可实现跨时间的目标主体分割与追踪，自主发现交互对象，并预测<主体，客体，谓词>三元组以构建时序一致的场景图。该框架包含两大核心组件：生成主体条件化对象提示的动态交互发现模块，以及执行联合实体与谓词推理的语义分类头。在OpenPVSG基准测试上的实验表明，Click2Graph为用户引导式PVSG奠定了坚实基础，揭示了如何通过人类提示与全景定位及关系推理的结合，实现可控可解释的视频场景理解。

English

State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts <subject, object, predicate> triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

Click2Graph：單次點擊生成交互式全景視頻場景圖

Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

摘要

Support