언제 그리고 무엇: 장기 비디오 이해를 위한 엔티티 인식 세그멘테이션 기반 Diffusion-Grounded VideoLLM

초록

비디오를 이해하는 것은 개방형 질문에 답하는 것 이상을 요구하며, 사건이 언제 발생하는지 정확히 파악하고 시간에 걸쳐 엔티티들이 어떻게 상호작용하는지를 파악할 수 있는 능력이 필요합니다. 최근 비디오 LLM(Video Large Language Models)은 전체적인 추론에서 놀라운 진전을 이루었지만, 시간적 인식 측면에서는 여전히 거친 수준에 머물러 있습니다: 타임스탬프가 암묵적으로만 인코딩되고, 프레임 수준의 특징은 연속성을 포착하는 데 약하며, 언어와 비전의 정렬이 종종 관심 대상 엔티티에서 벗어나는 경향이 있습니다. 본 논문에서는 이러한 한계를 극복하기 위해 세 가지 주요 혁신을 도입한 Grounded VideoDiT를 소개합니다. 첫째, Diffusion Temporal Latent (DTL) 인코더는 경계 민감성을 강화하고 시간적 일관성을 유지합니다. 둘째, 객체 기반 표현은 쿼리 엔티티를 지역화된 시각적 증거에 명시적으로 연결하여 정렬을 강화합니다. 셋째, 이산적 시간 토큰을 포함한 혼합 토큰 방식은 명시적인 타임스탬프 모델링을 제공하여 세밀한 시간적 추론을 가능하게 합니다. 이러한 설계를 통해 Grounded VideoDiT는 강력한 기반 능력을 갖추게 되었으며, Charades STA, NExT GQA 및 여러 VideoQA 벤치마크에서 최첨단 결과를 통해 검증되었습니다.

English

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

언제 그리고 무엇: 장기 비디오 이해를 위한 엔티티 인식 세그멘테이션 기반 Diffusion-Grounded VideoLLM

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

초록

Support