VideoLoom:面向时空联合理解的视频大语言模型
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
January 12, 2026
作者: Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu
cs.AI
摘要
本文提出VideoLoom——一种用于联合时空理解的统一视频大语言模型。为培养细粒度时空定位能力,我们构建了LoomData-8.7k数据集,该数据集以人为中心,包含时间锚定与空间定位的描述文本。基于此,VideoLoom在多项时空基准测试中达到领先或极具竞争力的性能(例如在指代视频目标分割任务ReVOS上获得63.1 J&F值,在时序定位任务Charades-STA上达到48.3 R1@0.7)。此外,我们推出LoomBench新型基准,包含时序、空间及组合型视频-问题对,可从多维度全面评估视频大语言模型。这些成果共同构成了一套通用有效的联合时空视频理解方案,为多模态智能树立了新标准。
English
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.