动态反射:基于文本对齐的视频表征探析
Dynamic Reflections: Probing Video Representations with Text Alignment
November 4, 2025
作者: Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov
cs.AI
摘要
近期研究表明,不同模态表征的对齐能够揭示跨数据类型编码器的结构相似性与下游任务能力。尽管图像-文本对齐已取得显著进展,但视频数据的时序特性在此领域的探索仍显不足。本研究首次对视频-文本表征对齐展开系统性探索,深入剖析现代视频与语言编码器的能力。我们的发现揭示了若干关键洞见:首先,实验证明跨模态对齐高度依赖于测试时提供的视觉(静态图像vs多帧视频)与文本(单句描述vs文本集合)数据的丰富度,这一现象在使用前沿视频编码器时尤为显著。我们提出的参数化测试规模定律精准捕捉了该规律,并显示出对实证结果的卓越预测能力。其次,我们探究了语义对齐与语义/非语义下游任务性能的关联性,初步证据表明与文本编码器的强对齐可能关联着通用视频表征与理解能力。最后,我们将时序推理与跨模态对齐建立关联,为视觉-语言模型提供了具有挑战性的测试基准。总体而言,本研究提出视频-文本对齐作为一种零样本评估方法,可有效探测编码器对时空数据的表征能力。项目页面详见:https://video-prh.github.io/
English
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/