DiagnosticIQ: Een benchmark voor op LLM gebaseerde aanbeveling van industriële onderhoudsacties uit symbolische regels

Samenvatting

Het monitoren van complexe industriële activa is afhankelijk van door ingenieurs geschreven symbolische regels die worden geactiveerd op basis van sensorcondities en technici ertoe aanzetten corrigerende acties uit te voeren. De bottleneck is niet detectie maar respons: het vertalen van regels naar onderhoudsstappen vereist activaspecifieke kennis die door jarenlange praktijkervaring is verworven. We onderzoeken of LLMs kunnen dienen als beslissingsondersteuning voor deze regel-naar-actie-stap en introduceren een benchmark van 6.690 door experts gevalideerde meerkeuzevragen uit 118 regel-actie-paren over 16 activatypen. We dragen bij: (i) een symbolisch-naar-MCQA-pijplijn die regels normaliseert naar Disjunctieve Normaalvorm met op embeddings gebaseerde afleidersteekproefneming, (ii) vijf varianten die verschillende faalmodi onderzoeken (Pro, Pert, Verbose, Aug, Rationale), en (iii) een benchmark van 29 LLMs en 4 baseline-embeddings. Een humanevaluatie (9 praktijkmensen, gemiddeld 45,0%) bevestigt dat specialistische kennis vereist die verder gaat dan operationele ervaring. Drie bevindingen springen eruit. De grens is gesloten: de top drie LLMs liggen binnen één Macro-punt, waarbij Bradley-Terry Elo claude-opus-4-6 30 punten boven het volgende model plaatst. Toch onthult Pro broosheid: elk model verliest 13–60% relatieve nauwkeurigheid bij uitbreiding van afleiders. Aug onthult patroonherkenning: bij conditie-inversie selecteren de grensmodelen nog steeds 49–63% van de tijd het oorspronkelijke antwoord. De implementatie-bottleneck is niet capaciteit maar kalibratie: grensmodelen kunnen sjabloonachtige foutdetectie aan, maar falen bij structurele perturbatie.

English

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.