IndustryBench-MIPU: Het benchmarken van multi-image attribuutwaarde-extractie voor industriële producten

Samenvatting

Industriële producten zoals kleppen en stroomonderbrekers worden gedefinieerd door gedetailleerde technische specificaties die inkoop, compatibiliteit en veiligheid in toeleveringsketens bepalen. Deze specificaties zijn verspreid over meerdere heterogene productafbeeldingen, waaronder specificatietabellen, typeplaatjes en technische tekeningen, maar of Multimodale Grote Taalmodellen (MLLM's) ze betrouwbaar kunnen extraheren blijft onderbelicht. Om deze leemte te vullen introduceren we IndustryBench-MIPU, de eerste grootschalige benchmark voor begrip van industriële producten uit meerdere afbeeldingen, gebouwd rond gestructureerde attribuutextractie – het extraheren van eigenschap-waardeparen uit productafbeeldingen. Deze taak test gezamenlijk tekstherkenning op specificatietabellen en typeplaatjes, visueel redeneren over technische tekeningen, domeinkennis om industriële terminologie te ontcijferen, en integratie van bewijs over afbeeldingen om verspreide specificaties samen te voegen. Concreet omvat de benchmark 4.559 producten over 27.652 afbeeldingen met 103.703 annotaties verspreid over 18 industriële categorieën, geconstrueerd via multi-modelconsensus en kwaliteitsborging op drie niveaus. Evaluatie van negen MLLM's in zowel instellingen met één afbeelding als met meerdere afbeeldingen op productniveau onthult een scherpe volledigheidskloof: modellen behalen hoge precisie (86–94%), maar het beste model extraheert slechts 49,9% van de attributen op productniveau; de overgang van extractie uit één afbeelding naar meerdere afbeeldingen kost 15–34 procentpunt aan recall. Volledigheid over meerdere afbeeldingen, niet nauwkeurigheid op één afbeelding, is de kernfles. De dataset en code zijn openbaar beschikbaar.

English

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.