ChatPaper.aiChatPaper

HERBench:视频问答中多证据整合的基准测试框架

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

December 16, 2025
作者: Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
cs.AI

摘要

视频大语言模型(Video-LLM)正快速发展,但现有视频问答基准测试常允许仅凭单一显著线索作答,未能充分检验需要整合多个时间分散视觉证据的推理能力。我们推出HERBench——一个专门用于评估跨时间多证据整合能力的视频问答基准。该基准每个问题均需整合至少三个分布于不同视频片段的非重叠证据线索,使得语言先验或单帧快照均无法满足答题需求。HERBench包含2.6万个五选一选择题,划分为十二项组合式任务,涵盖身份绑定、跨实体关系、时序排序、共现验证及计数等维度。为量化证据需求,我们提出"最小必需帧集"(MRFS)指标,即模型正确作答所需融合的最小帧数,并证明HERBench的MRFS要求(均值5.5帧)显著高于现有数据集(均值2.6-4.2帧)。对13个前沿视频大语言模型的评估显示普遍性失效:31-42%的准确率仅略高于20%的随机猜测基线。我们将此失效归因于两个关键瓶颈:(1)检索缺陷——帧选择器遗漏关键证据;(2)融合缺陷——即使提供全部必要证据,模型仍无法有效整合信息。通过使跨时间证据成为不可回避且可量化的评估要素,HERBench为推进鲁棒性组合式视频理解确立了原则性目标。
English
Video Large Language Models (Video-LLMs) are rapidly improving, yet current Video Question Answering (VideoQA) benchmarks often allow questions to be answered from a single salient cue, under-testing reasoning that must aggregate multiple, temporally separated visual evidence. We present HERBench, a VideoQA benchmark purpose-built to assess multi-evidence integration across time. Each question requires aggregating at least three non-overlapping evidential cues across distinct video segments, so neither language priors nor a single snapshot can suffice. HERBench comprises 26K five-way multiple-choice questions organized into twelve compositional tasks that probe identity binding, cross-entity relations, temporal ordering, co-occurrence verification, and counting. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes substantially higher demand than prior datasets (mean MRFS 5.5 vs. 2.6-4.2). Evaluating 13 state-of-the-art Video-LLMs on HERBench reveals pervasive failures: accuracies of 31-42% are only slightly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. By making cross-time evidence both unavoidable and quantifiable, HERBench establishes a principled target for advancing robust, compositional video understanding.
PDF92December 23, 2025