IndustryBench-MIPU: 工業製品の複数画像属性値抽出のベンチマーク

要旨

バルブやサーキットブレーカなどの工業製品は、調達、互換性、安全性をサプライチェーン全体で規定する膨大な技術仕様によって定義されています。これらの仕様は、仕様表、銘板、技術図面など、複数の異種の製品画像に散在していますが、マルチモーダル大規模言語モデル（MLLM）がそれらを確実に復元できるかどうかは、まだ十分に調査されていません。このギャップを埋めるために、我々はIndustryBench-MIPUを提案します。これは、構造化属性抽出（製品画像からプロパティと値のペアを復元すること）を中心に構築された、マルチイメージの工業製品理解のための初の大規模ベンチマークです。このタスクは、仕様表と銘板に対するテキスト認識、技術図面に対する視覚的推論、工業用語を解読するためのドメイン知識、散在する仕様を統合するためのクロスイメージ証拠統合を同時に検証します。具体的には、このベンチマークは、マルチモデルコンセンサスと3層品質保証により構築され、18の産業カテゴリにわたる103,703のアノテーションを伴う27,652枚の画像にわたる4,559の製品で構成されます。単一画像設定と製品レベルのマルチ画像設定の両方で9つのMLLMを評価した結果、顕著な完全性のギャップが明らかになりました。モデルは高い精度（86～94%）を達成するものの、最良のモデルでも製品レベルの属性の49.9%しか復元できず、単一画像からマルチ画像抽出に移行すると再現率が15～34パーセントポイント低下します。単一画像の精度ではなく、マルチ画像の完全性が核となるボトルネックです。データセットとコードは公開されています。

English

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.