属性構造化と品質検証済み命令による普遍的なビデオMLLMの実現に向けて

要旨

ユニバーサルな映像理解には、多様な実世界シナリオにおいて、時間的に変化する微細な視覚・聴覚情報のモデリングが求められる。しかし、既存モデルの性能は、複雑な視聴覚コンテンツを単一で不完全な記述として表現し、微細な構造化や信頼性の高い注釈を欠く、映像指示データによって主に制約されている。この問題に対処するため、我々は以下を提案する：(i) **ASID-1M** — 単一属性および複数属性の監督付きで、100万の構造化された微細な視聴覚指示注釈からなるオープンソースデータセット；(ii) **ASID-Verify** — 注釈のためのスケーラブルなデータキュレーションパイプライン。記述と対応する視聴覚コンテンツ間の意味的・時間的一貫性を強制する自動検証・洗練機能を備える；(iii) **ASID-1M** で教師ありファインチューニング（SFT）により学習された映像理解モデル **ASID-Captioner**。視聴覚キャプション生成、属性別キャプション生成、キャプションに基づく質疑応答、キャプションに基づく時間的定位を含む7つのベンチマークによる実験では、ASID-Captionerが微細なキャプションの品質を向上させつつ、幻覚を減少させ、指示追従性を改善することが示された。本モデルはオープンソースモデルの中でState-of-the-artの性能を達成し、Gemini-3-Proに匹敵する競争力を有する。

English

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

属性構造化と品質検証済み命令による普遍的なビデオMLLMの実現に向けて

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

要旨

Support