IndustryBench-MIPU: 針對工業產品的多圖像屬性值提取評測基準
IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
June 12, 2026
作者: Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding
cs.AI
摘要
閥門、斷路器等工業產品,由涵蓋採購、相容性與供應鏈安全等層面的密集技術規格所定義。這些規格散落在規格表、銘牌與技術圖紙等多種異質產品圖像中,然而多模態大型語言模型能否可靠地從中恢復資訊,仍屬未充分探討的範疇。為填補此缺口,我們提出 IndustryBench-MIPU,首個針對多圖像工業產品理解的大規模基準,其核心為結構化屬性抽取——從產品圖像中還原屬性-值對。此任務同時要求對規格表與銘牌進行文字辨識、對技術圖紙進行視覺推理、運用領域知識解碼工業術語,以及跨圖像證據整合以彙整分散的規格。具體而言,該基準涵蓋 18 個工業類別、4,559 項產品、27,652 張圖像及 103,703 筆註釋,透過多模型共識與三層品質保證建構而成。我們在單圖像與產品級多圖像設定下評估九個 MLLM,赫然發現完整性缺口:模型達到高精準度(86–94%),但最佳結果僅恢復 49.9% 的產品級屬性;從單圖像切換至多圖像抽取,召回率下降 15–34 個百分點。核心瓶頸在於多圖像完整性,而非單圖像準確度。資料集與程式碼已公開。
English
Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.