从文本到视频的时间推理转移

摘要

视频大型语言模型（Video LLMs）在视频理解方面表现出有希望的能力，但在跟踪时间变化和推理时间关系方面存在困难。先前的研究认为这一限制是由于视觉输入的时间编码不够有效，然而我们的诊断研究揭示了视频表示包含足够信息，即使是小型探测分类器也能实现完美准确度。令人惊讶的是，我们发现视频LLMs在时间推理能力上的关键瓶颈源自基础LLM对时间概念的固有困难，这一点在文本时间问答任务的表现不佳中得到证明。基于这一发现，我们引入了文本时间推理转移（T3）。T3从现有的图像-文本数据集中合成纯文本格式的多样化时间推理任务，解决了视频样本中复杂时间场景的稀缺性。值得注意的是，不使用任何视频数据，T3增强了LongVA-7B的时间理解能力，在具有挑战性的TempCompass基准测试中取得了5.3个绝对准确度的提升，使我们的模型能够胜过在28,000个视频样本上训练的ShareGPT4Video-8B。此外，增强的LongVA-7B模型在全面的视频基准测试中表现出竞争力。例如，在Video-MME的时间推理任务中取得了49.7的准确度，超过了InternVL-Chat-V1.5-20B和VILA1.5-40B等强大的大规模模型。进一步分析揭示了文本和视频时间任务表现之间的强相关性，验证了从文本到视频领域转移时间推理能力的有效性。

English

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, without using any video data, T3 enhances LongVA-7B's temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

从文本到视频的时间推理转移

Temporal Reasoning Transfer from Text to Video

摘要

Support