從文本到影片的時間推理轉移

摘要

影片大型語言模型（Video LLMs）展現了在影片理解方面的潛力，然而在追蹤時間變化和推理時間關係方面仍有困難。先前的研究認為這種限制是由於視覺輸入的時間編碼不夠有效，但我們的診斷研究揭示了影片表示包含足夠的信息，即使是小型探測分類器也能實現完美的準確性。令人驚訝的是，我們發現Video LLMs在時間推理能力上的關鍵瓶頸源於基礎LLM對時間概念的固有困難，這表現在對文本時間問答任務表現不佳。基於這一發現，我們引入了Textual Temporal reasoning Transfer（T3）。T3從現有的圖像-文本數據集中合成純文本格式的多樣時間推理任務，解決了缺乏具有複雜時間情景的影片樣本的問題。值得注意的是，在不使用任何影片數據的情況下，T3增強了LongVA-7B的時間理解能力，在具有挑戰性的TempCompass基準測試中實現了5.3個絕對準確度的提升，使我們的模型能夠超越在28,000個影片樣本上訓練的ShareGPT4Video-8B。此外，增強的LongVA-7B模型在全面的影片基準測試中實現了競爭性表現。例如，在Video-MME的時間推理任務中實現了49.7的準確度，超越了InternVL-Chat-V1.5-20B和VILA1.5-40B等強大的大規模模型。進一步的分析揭示了文本和影片時間任務表現之間的強相關性，驗證了從文本到影片領域轉移時間推理能力的有效性。

English

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

從文本到影片的時間推理轉移

Temporal Reasoning Transfer from Text to Video

摘要

Support