合成视觉基因组2：从视频中提取大规模时空场景图

摘要

我们推出合成视觉基因组2（SVG2）——一个大规模全景视频场景图数据集。该数据集包含63.6万段视频、660万个物体、5200万个属性和670万组关系，在规模与多样性上较先前时空场景图数据集实现数量级提升。通过构建全自动流水线，我们融合了多尺度全景分割、支持新物体自动发现的在线-离线轨迹追踪、单轨迹语义解析以及基于GPT-5的时空关系推理。基于此资源，我们训练了TRaSER视频场景图生成模型。该模型通过轨迹对齐的令牌编排机制，结合物体轨迹重采样器与时间窗口重采样器两大新模块，将原始视频和全景轨迹转换为紧凑的时空场景图。时间窗口重采样器将视觉令牌绑定至短轨迹片段以保留局部运动与时间语义，而物体轨迹重采样器则聚合完整轨迹以维持物体的全局上下文。在PVSG、VIPSeg、VidOR及SVG2测试集上，TRaSER将关系检测性能提升15-20%，物体预测能力较最强开源基线提高30-40%（较GPT-5提升13%），属性预测准确率提升15%。当TRaSER生成的场景图用于视频问答任务时，相较于仅使用视频或结合Qwen2.5-VL生成场景图的方法，其绝对准确率提升1.5-4.6%，证明了显式时空场景图作为中间表征的有效性。

English

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.