OmniVideo-100K:透過結構化腳本與證據鏈進行音視頻推理的數據集
OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains
June 12, 2026
作者: Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan
cs.AI
摘要
現有的音視頻問答(QA)自動化流程普遍採用「影片-字幕-QA」模式。然而,此類方法通常將影片分割成短片段,並分別為音頻與視覺模態生成獨立描述。這種解耦處理切斷了聲音與其視覺來源之間的內在關聯,而獨立的片段處理更常導致同一實體在不同片段中出現不一致的描述。此外,將長文本理解與QA合成耦合為單一步驟,往往使模型侷限於局部事件,產生的問題缺乏長期時間關聯與深度跨模態推理。為解決這些問題,我們提出一種自動化資料引擎,具備兩種機制:(1) 實體錨定影片腳本化(Entity-Anchored Video Scripting),將影片轉換為結構化腳本,包含摘要、主要實體列表及逐段音視頻描述。實體列表作為全域先驗,確保跨片段指涉一致性並重建音視頻關聯。(2) 線索引導QA生成(Clue-Guided QA Generation),引導模型先從腳本中挖掘跨片段、多模態的線索,再基於這些高價值線索生成QA對。利用此流程,我們建構了指令調優資料集OmniVideo-100K以及人工驗證的測試集OmniVideo-Test。在OmniVideo-100K上對VITA-1.5、Qwen2.5-Omni-7B與Qwen3-Omni-30B進行微調後,在OmniVideo-Test上效能提升最高達20.59%,且在Daily-Omni與JointAVBench等既有基準上展現強大的泛化能力(最高提升12.64%)。
English
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.