OmniVideo-100K:一个通过结构化脚本和证据链进行视听推理的数据集
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains
June 12, 2026
作者: Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan
cs.AI
摘要
当前的音视频问答(QA)自动管道普遍采用“视频-字幕-问答”范式。然而,这类方法通常将视频分割为短片段,并为音频和视觉模态分别生成独立描述。这种解耦处理切断了声音与其视觉来源之间的固有联系,而独立的片段处理常导致同一实体在不同片段中出现不一致的描述。此外,将长文本理解与问答生成耦合至单一处理步骤,往往使模型局限于局部事件,从而产生缺乏长期时间关联和深度跨模态推理的问题。针对上述问题,我们提出一种包含两种机制的自动数据引擎:(1)基于实体的视频脚本化(Entity-Anchored Video Scripting)将视频转化为结构化脚本,包含摘要、主要实体列表及逐片段的音视频描述。其中实体列表作为全局先验信息,确保跨片段的指代一致性并重建音视频关联。(2)线索引导的问答生成(Clue-Guided QA Generation)引导模型首先从脚本中挖掘跨片段、多模态的线索,进而基于这些高价值线索生成问答对。借助该管道,我们构建了指令微调数据集OmniVideo-100K及人工验证测试集OmniVideo-Test。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B模型后,其在OmniVideo-Test上的性能提升高达20.59%,并在Daily-Omni和JointAVBench等现有基准上展现出强大的泛化能力(性能提升最高达12.64%)。
English
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.