动态反思:基于文本对齐的视频表征探测研究
Dynamic Reflections: Probing Video Representations with Text Alignment
November 4, 2025
作者: Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov
cs.AI
摘要
近年來,多模態表徵的對齊研究已揭示不同編碼器在跨數據類型中的結構相似性與下游任務能力。儘管圖像-文本對齊取得顯著進展,但視頻數據的時序特性在此領域仍鮮少被探索。本研究首次對視頻-文本表徵對齊展開系統性研究,深入剖析現代視頻與語言編碼器的能力。我們獲得三項關鍵發現:首先,證實跨模態對齊效果高度依賴測試時視覺數據(靜態圖像vs多幀視頻)與文本數據(單句描述vs文本集合)的豐富度,尤其在使用前沿視頻編碼器時更為明顯。我們提出參數化測試時擴展定律來刻畫此現象,該定律在實證觀察中展現卓越的預測能力。其次,通過探究語義對齊與下游任務(含語義性與非語義性任務)表現的關聯,我們首次發現與文本編碼器的強對齊可能關聯著通用視頻表徵與理解能力。最後,我們建立時序推理與跨模態對齊的關聯性,為視覺-語言模型提供具挑戰性的測試平台。總體而言,本研究開創性地將視頻-文本對齊作為零樣本評估框架,為時空數據編碼器的表徵能力提供新的洞察維度。項目頁面詳見:https://video-prh.github.io/
English
The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/