OmniVideo-100K: 구조화된 스크립트와 증거 체인을 통한 시청각 추론 데이터셋

초록

현재 자동화된 시청각 질의응답(QA) 파이프라인은 일반적으로 '비디오-캡션-QA' 패러다임을 채택합니다. 그러나 이러한 방법들은 대개 비디오를 짧은 클립으로 분할하고 오디오 및 시각적 양식에 대해 별도의 설명을 생성합니다. 이러한 분리된 처리는 소리와 시각적 출처 간의 고유한 연관성을 단절시키며, 개별 클립 처리는 동일한 개체에 대해 세그먼트 간 일관되지 않은 설명을 초래하는 경우가 많습니다. 더욱이 긴 텍스트 이해와 QA 합성을 단일 단계로 결합하면 모델이 국지적 이벤트에 제한되어 장기적 시간 연결과 깊은 교차 양식 추론이 부족한 질문을 생성하게 됩니다. 이러한 문제를 해결하기 위해 우리는 두 가지 메커니즘을 특징으로 하는 자동화된 데이터 엔진을 제안합니다: (1) 개체 기반 비디오 스크립팅(Entity-Anchored Video Scripting)은 비디오를 요약, 주요 개체 목록, 세그먼트별 시청각 설명으로 구성된 구조화된 스크립트로 변환합니다. 개체 목록은 세그먼트 간 참조 일관성을 보장하고 시청각 연관성을 재구성하기 위한 전역 사전 정보 역할을 합니다. (2) 단서 기반 QA 생성(Clue-Guided QA Generation)은 모델이 먼저 스크립트에서 세그먼트 간, 다중 모드 단서를 마이닝한 다음 이러한 고가치 단서를 기반으로 QA 쌍을 생성하도록 유도합니다. 이 파이프라인을 활용하여 우리는 명령어 튜닝 데이터셋인 OmniVideo-100K와 사람이 검증한 테스트 세트인 OmniVideo-Test를 구축합니다. OmniVideo-100K에서 VITA-1.5, Qwen2.5-Omni-7B 및 Qwen3-Omni-30B를 미세 조정하면 OmniVideo-Test에서 최대 20.59%의 성능 향상을 얻을 수 있으며, Daily-Omni 및 JointAVBench와 같은 기존 벤치마크에서 강력한 일반화(최대 12.64% 개선)를 보여줍니다.

English

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.