IndustryBench-MIPU: 산업 제품 다중 이미지 속성값 추출 벤치마크

초록

밸브와 회로 차단기와 같은 산업 제품은 공급망 전반에 걸쳐 조달, 호환성 및 안전성을 규율하는 복잡한 기술 사양으로 정의된다. 이러한 사양은 사양표, 명판, 기술 도면 등 여러 이질적인 제품 이미지에 분산되어 있지만, 멀티모달 거대 언어 모델(MLLM)이 이를 신뢰성 있게 복원할 수 있는지 여부는 충분히 탐구되지 않았다. 이 간극을 메우기 위해, 우리는 IndustryBench-MIPU를 소개한다. 이는 제품 이미지로부터 속성-값 쌍을 복원하는 구조화된 속성 추출을 중심으로 구축된, 다중 이미지 산업 제품 이해를 위한 최초의 대규모 벤치마크이다. 이 과제는 사양표와 명판에 대한 텍스트 인식, 기술 도면에 대한 시각적 추론, 산업 용어를 해독하기 위한 도메인 지식, 그리고 분산된 사양을 통합하기 위한 교차 이미지 증거 통합을 동시에 탐구한다. 구체적으로, 이 벤치마크는 18개의 산업 범주에 걸쳐 103,703개의 주석이 포함된 27,652개의 이미지에 걸친 4,559개의 제품으로 구성되며, 다중 모델 합의와 3단계 품질 보증을 통해 구축되었다. 단일 이미지 및 제품 수준의 다중 이미지 설정에서 9개의 MLLM을 평가한 결과, 현저한 완전성 격차가 드러났다: 모델은 높은 정밀도(86-94%)를 달성하지만, 최고 모델도 제품 수준 속성의 49.9%만 복원한다. 단일 이미지에서 다중 이미지 추출로 전환할 때 재현율이 15-34% 포인트 감소한다. 단일 이미지 정확도가 아닌 다중 이미지 완전성이 핵심 병목이다. 데이터셋과 코드는 공개적으로 제공된다.

English

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.