ChatPaper.aiChatPaper

LongVideoBench:一個針對長篇上下文交錯的視訊-語言理解的基準測試

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

July 22, 2024
作者: Haoning Wu, Dongxu Li, Bei Chen, Junnan Li
cs.AI

摘要

大型多模型模型(LMMs)正在處理越來越長且更豐富的輸入。儘管取得了進展,但很少有公開基準可用於衡量這種發展。為彌補這一差距,我們引入了LongVideoBench,這是一個問答基準,具有長達一小時的視頻-語言交錯輸入。我們的基準包括3,763個不同長度的網絡收集視頻及其字幕,涵蓋各種主題,旨在全面評估LMMs對長期多模式理解的能力。為了實現這一目標,我們將主要挑戰定義為從長輸入中準確檢索和推理詳細的多模式信息。因此,我們制定了一個新的視頻問答任務,稱為指代推理。具體而言,在問題的一部分中,它包含一個引用查詢,引用相關的視頻上下文,稱為被引用上下文。然後,模型需要從被引用上下文中推理出相關的視頻細節。遵循指代推理的範式,我們精心策劃了6,678個人工標註的17個細粒度類別的多選問題,建立了一個最全面的長格式視頻理解基準之一。評估表明,即使對於最先進的專有模型(例如GPT-4o、Gemini-1.5-Pro、GPT-4-Turbo),LongVideoBench也提出了重大挑戰,而它們的開源對應模型表現出更大的性能差距。此外,我們的結果表明,模型在基準上的表現僅在它們能夠處理更多幀時才會提高,這將LongVideoBench定位為評估未來一代長上下文LMMs的寶貴基準。
English
Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

Summary

AI-Generated Summary

PDF204November 28, 2024