IndustryBench-MIPU：面向工业产品的多图像属性值提取基准测试

摘要

阀门、断路器之类的工业产品，其定义依赖于描述详实的技术规格，这些规格制约着采购、兼容性以及整个供应链的安全性。然而，这些规格分散在多种异构的产品图像中，包括规格表、铭牌和技术图纸；多模态大语言模型（MLLMs）能否可靠地从中提取这些信息，目前尚未得到充分探索。为填补这一空白，我们提出了 IndustryBench-MIPU，这是首个面向多图像工业产品理解的大规模基准测试，其核心任务是结构化属性抽取——从产品图像中恢复属性-值对。该任务同时检验了多方面的能力：在规格表和铭牌上的文本识别、技术图纸上的视觉推理、解码行业术语所需的领域知识，以及跨图像证据整合以汇总分散的规格信息。具体而言，该基准测试包含 4,559 个产品，共 27,652 张图像，囊括 103,703 条标注，覆盖 18 个工业类别；其构建过程融合了多模型共识与三层质量保障。我们对九种 MLLMs 在单图像和产品级多图像两种设定下进行了评估，结果揭示了一个显著的完整性差距：模型达到了较高的精确率（86%–94%），但最佳模型仅能恢复 49.9% 的产品级属性；从单图像提取切换到多图像提取时，召回率下降了 15 到 34 个百分点。因此，核心瓶颈在于多图像环境下的完整性，而非单图像精度。数据集与代码均已公开。

English

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.