OmniVideo-100K: 構造化スクリプトと証拠連鎖による音声視覚推論のためのデータセット

要旨

現在の音声動画質問応答（QA）向け自動パイプラインは、一般に「映像キャプションQA」パラダイムを採用している。しかし、これらの手法では通常、映像を短いクリップに分割し、音声モダリティと視覚モダリティに対して別々の記述を生成する。このように切り離された処理は、音とその視覚的源泉との間の本来の関連を断ち切り、さらに独立したクリップ処理によって、同一エンティティがセグメント間で一貫しない記述となることが多い。加えて、長文理解とQA生成を単一のステップに結合することで、モデルが局所的なイベントに制限されやすくなり、長期的な時間的接続や深いクロスモーダル推論を欠いた質問が生成される。これらの問題に対処するため、本稿では以下の2つのメカニズムを備えた自動データエンジンを提案する。（1）エンティティアンカー型ビデオスクリプティングは、映像を構造化されたスクリプト（要約、主要エンティティリスト、セグメント単位の音声・映像記述を含む）に変換する。エンティティリストはグローバルな事前情報として機能し、セグメント間の参照一貫性を保証し、音声と視覚の関連を再構築する。（2）手がかり誘導型QA生成は、モデルに対してまずスクリプトからセグメント横断的なマルチモーダルな手がかりを抽出させ、その後、これらの高価値な手がかりに基づいてQAペアを生成するように促す。本パイプラインを活用し、命令チューニング用データセットOmniVideo-100Kと、人間が検証したテストセットOmniVideo-Testを構築した。VITA-1.5、Qwen2.5-Omni-7B、Qwen3-Omni-30BをOmniVideo-100Kでファインチューニングした結果、OmniVideo-Testにおいて最大20.59%の性能向上を達成し、Daily-OmniやJointAVBenchなどの既存ベンチマークにおいても最大12.64%の改善と、強力な汎化能力を示した。

English

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.