迈向具备属性结构化与质量验证指令的通用视频多模态大语言模型
Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions
February 13, 2026
作者: Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
cs.AI
摘要
通用视频理解需要针对多样化现实场景中的细粒度视觉与音频信息进行时序建模。然而,现有模型的性能主要受限于视频指令数据——这些数据将复杂的视听内容简化为单一且不完整的描述,缺乏细粒度组织与可靠标注。为此,我们提出:(一)ASID-1M,一个包含百万级结构化细粒度视听指令标注的开源数据集,支持单属性与多属性监督;(二)ASID-Verify,一个可扩展的数据标注流水线,通过自动验证与优化机制确保描述与对应视听内容在语义和时序上的一致性;(三)ASID-Captioner,基于ASID-1M数据集通过监督微调训练的视频理解模型。在涵盖视听描述、属性级描述、基于描述的问答及时序定位的七项基准测试中,ASID-Captioner在提升细粒度描述质量的同时有效减少了幻觉现象,并显著改善了指令遵循能力。该模型在开源模型中实现了最先进的性能,并与Gemini-3-Pro保持竞争力。
English
Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.