合成ビジュアルゲノム2：動画からの大規模時空間シーングラフ抽出

要旨

本論文では、大規模なパノプティック動画シーングラフデータセットであるSynthetic Visual Genome 2（SVG2）を提案する。SVG2は63万6,000本以上の動画、660万のオブジェクト、5,200万の属性、670万の関係を含み、従来の時空間シーングラフデータセットと比較して規模と多様性において桁違いの拡張を実現している。SVG2の構築には、マルチスケールパノプティックセグメンテーション、自動的新規オブジェクト発見を伴うオンライン・オフライン軌跡追跡、軌跡単位の意味解析、GPT-5ベースの時空間関係推論を統合した完全自動化パイプラインを設計した。このリソースに基づき、動画シーングラフ生成モデルTRaSERを訓練した。TRaSERは視覚言語モデル（VLM）を拡張し、軌跡整合トークン配置機構と、生の動画とパノプティック軌跡を単一のフォワードパスでコンパクトな時空間シーングラフに変換する新モジュール（オブジェクト軌跡リサンプラと時間ウィンドウリサンプラ）を備える。時間ウィンドウリサンプラは視覚トークンを短い軌跡セグメントに紐付け局所的な動きと時間的意味を保持し、オブジェクト軌跡リサンプラは軌跡全体を集約してオブジェクトの大域的文脈を維持する。PVSG、VIPSeg、VidOR、SVG2のテストデータセットにおいて、TRaSERは関係検出で最強のオープンソースベースラインより15～20%、GPT-5より13%、オブジェクト予測で30～40%、属性予測で15%の性能向上を達成した。TRaSERが生成したシーングラフを動画質問応答用VLMに入力すると、動画単独またはQwen2.5-VL生成シーングラフ追加の場合と比べ、絶対精度で1.5～4.6%向上し、明示的な時空間シーングラフが中間表現として有効であることを実証した。

English

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

合成ビジュアルゲノム2：動画からの大規模時空間シーングラフ抽出

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

要旨

Support