합성 비주얼 게놈 2: 비디오에서 대규모 시공간적 장면 그래프 추출

초록

우리는 대규모 팬옵틱 비디오 장면 그래프 데이터셋인 Synthetic Visual Genome 2(SVG2)를 소개한다. SVG2는 63만 6천 개 이상의 비디오와 660만 개의 객체, 5,200만 개의 속성, 670만 개의 관계를 포함하며, 기존 시공간 장면 그래프 데이터셋 대비 규모와 다양성에서 한 차원 높은 수준을 제공한다. SVG2를 구축하기 위해 우리는 다중 스케일 팬옵틱 분할, 자동 신규 객체 발견 기능을 갖춘 온라인-오프라인 궤적 추적, 궤적별 의미론적 파싱, GPT-5 기반 시공간 관계 추론을 결합한 완전 자동화 파이프라인을 설계했다. 이 자원을 바탕으로 비디오 장면 그래프 생성 모델인 TRaSER를 학습시켰다. TRaSER는 VLM에 궤적 정렬 토큰 배열 메커니즘과 객체-궤적 리샘플러, 시간-윈도우 리샘플러라는 새로운 모듈을 추가하여 원본 비디오와 팬옵틱 궤적을 단일 순전파로 간결한 시공간 장면 그래프로 변환한다. 시간-윈도우 리샘플러는 짧은 궤적 세그먼트에 시각 토큰을 결합하여 지역적 운동 및 시간적 의미를 보존하는 반면, 객체-궤적 리샘플러는 전체 궤적을 집약하여 객체의 전역적 문맥을 유지한다. PVSG, VIPSeg, VidOR 및 SVG2 테스트 데이터셋에서 TRaSER는 관계 검출 성능을 최강 오픈소스 기준 대비 +15~20%, 객체 예측 성능을 +30~40%(GPT-5 대비 +13%), 속성 예측 성능을 +15% 향상시켰다. TRaSER가 생성한 장면 그래프를 VLM에 전달하여 비디오 질의응답을 수행할 때, 비디오만 사용하거나 Qwen2.5-VL 생성 장면 그래프를 보강한 경우보다 절대 정확도가 +1.5~4.6% 향상되어 명시적 시공간 장면 그래프의 중간 표현으로서 유용성을 입증했다.

English

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

합성 비주얼 게놈 2: 비디오에서 대규모 시공간적 장면 그래프 추출

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

초록

Support