LongVideoBench：长上下文交织视频-语言理解基准

摘要

大型多模态模型（LMMs）正在处理越来越长和更丰富的输入。尽管取得了进展，但很少有公开的基准可用于衡量这种发展。为弥补这一差距，我们引入了LongVideoBench，这是一个问答基准，具有长达一小时的视频-语言交错输入。我们的基准包括3,763个不同长度的网络收集视频及其字幕，涵盖多样化主题，旨在全面评估长期多模态理解的LMMs。为实现这一目标，我们将主要挑战解释为准确检索和推理长输入中的详细多模态信息。因此，我们提出了一项新颖的视频问答任务，称为指代推理。具体而言，在问题的一部分中，它包含一个引用查询，引用相关视频上下文，称为被引用上下文。然后，模型需要推理被引用上下文中相关视频细节。遵循指代推理的范例，我们精心策划了6,678个人工注释的17个细粒度类别的多项选择问题，建立了一个最全面的长格式视频理解基准之一。评估表明，即使对于最先进的专有模型（例如GPT-4o、Gemini-1.5-Pro、GPT-4-Turbo），LongVideoBench也提出了重大挑战，而它们的开源对应模型表现出更大的性能差距。此外，我们的结果表明，模型在基准测试中的性能仅在能够处理更多帧时才会提高，将LongVideoBench定位为评估未来长上下文LMMs的有价值基准。

English

Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

LongVideoBench：长上下文交织视频-语言理解基准

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

摘要

Support