HERBench:视频问答中多证据融合的基准测试框架
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
December 16, 2025
作者: Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
cs.AI
摘要
视频大语言模型(Video-LLMs)发展迅猛,但现有视频问答(VideoQA)基准测试常允许仅凭单一显著线索回答问题,未能充分检验需要整合多个时间分散视觉证据的推理能力。我们推出HERBench——一个专为评估跨时间多证据整合能力构建的VideoQA基准。每个问题需聚合至少三个分布于不同视频片段的非重叠证据线索,使得语言先验或单帧快照均无法解题。该基准包含2.6万个五选一选择题,划分为十二项组合式任务,涵盖身份绑定、跨实体关系、时序排序、共现验证及计数等维度。为量化证据需求,我们提出"最小必需帧集"(MRFS)指标,即模型正确答题必须融合的最小帧数,并证明HERBench的MRFS要求显著高于现有数据集(平均MRFS为5.5,对比2.6-4.2)。对13个前沿Video-LLMs的评估揭示普遍缺陷:31-42%的准确率仅略高于20%的随机猜测基线。我们将失败归因于两个关键瓶颈:(1)检索缺陷——帧选择器遗漏关键证据;(2)融合缺陷——即使提供全部必要证据,模型仍无法整合信息。通过使跨时间证据成为不可回避且可量化的评估要素,HERBench为推进鲁棒的组合式视频理解确立了原则性目标。
English
Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.