Dr.V：基于细粒度时空定位的视频幻觉诊断分层感知-时序-认知框架

摘要

近期，大规模视频模型（LVMs）的显著进展极大地提升了视频理解能力。然而，这些模型仍存在幻觉问题，生成的内容与输入视频相矛盾。为解决这一问题，我们提出了Dr.V，一个涵盖感知、时序和认知层次的分层框架，通过细粒度的时空定位来诊断视频幻觉。Dr.V由两个关键组件构成：基准数据集Dr.V-Bench和卫星视频代理Dr.V-Agent。Dr.V-Bench包含从4,974个视频中抽取的10,000个实例，覆盖多样任务，每个实例均配有详细的时空标注。Dr.V-Agent通过在感知和时序层次上系统应用细粒度时空定位，随后进行认知层次推理，来检测LVMs中的幻觉。这一逐步处理流程模拟了人类般的视频理解过程，有效识别了幻觉。大量实验表明，Dr.V-Agent在诊断幻觉的同时，增强了可解释性和可靠性，为现实场景中的稳健视频理解提供了实用蓝图。我们的所有数据和代码均可在https://github.com/Eurekaleo/Dr.V获取。

English

Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.

Dr.V：基于细粒度时空定位的视频幻觉诊断分层感知-时序-认知框架

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

摘要

Support