合成视觉基因组2：从视频中提取大规模时空场景图

摘要

我们推出合成视觉基因组2（SVG2）——一个大规模全景视频场景图数据集。该数据集包含63.6万段视频、660万个物体、5200万个属性和670万组关系，在规模与多样性上较现有时空场景图数据集实现数量级提升。通过结合多尺度全景分割、支持新物体自动发现的在线-离线轨迹追踪、单轨迹语义解析及基于GPT-5的时空关系推理，我们构建了全自动流水线来生成SVG2。基于此资源，我们训练了TRaSER视频场景图生成模型。该模型通过轨迹对齐的令牌编排机制，配合物体轨迹重采样器和时序窗口重采样器两大新模块，可将原始视频与全景轨迹在单次前向传播中压缩为紧凑的时空场景图。时序窗口重采样器将视觉令牌绑定至短轨迹片段以保留局部运动与时序语义，而物体轨迹重采样器则聚合完整轨迹以维持物体的全局上下文。在PVSG、VIPSeg、VidOR及SVG2测试集上，TRaSER将关系检测性能提升15-20%，物体预测能力较最强开源基线提升30-40%、较GPT-5提升13%，属性预测精度提高15%。当将TRaSER生成的场景图输入视觉语言模型进行视频问答时，相较仅使用视频或视频结合Qwen2.5-VL生成场景图的方法，其绝对准确率提升1.5-4.6%，证明了显式时空场景图作为中间表征的有效性。

English

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

合成视觉基因组2：从视频中提取大规模时空场景图

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

摘要

Support