迈向开放词汇工业缺陷理解:基于大规模多模态数据集的研究
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
December 30, 2025
作者: TsaiChing Ni, ZhenQi Chen, YuanFu Yang
cs.AI
摘要
我们推出IMDD-1M——首个包含100万对齐图文对的大规模工业多模态缺陷数据集,旨在推动制造业与质量检测领域的多模态学习。该数据集涵盖60余种材料类别、400多种缺陷类型的高分辨率真实缺陷样本,每个样本均配备专家核验的标注信息及描述缺陷位置、严重程度与上下文属性的细粒度文本说明。本数据集支持分类、分割、检索、描述生成和生成式建模等广泛应用。基于IMDD-1M,我们从头训练了专为工业场景定制的扩散式视觉语言基础模型。该模型作为通用化基础架构,可通过轻量级微调高效适配专业领域:仅需专用专家模型不足5%的任务特定数据,即可实现相当性能,彰显了数据高效的基础模型适配在工业检测与生成领域的潜力,为可扩展、领域自适应及知识驱动的智能制造开辟了新路径。
English
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.