ChatPaper.aiChatPaper

DiagnosticIQ:基于符号规则的大语言模型工业维护行动推荐基准

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

May 9, 2026
作者: Devin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Paul J Adams, Sal Rosato, Nicolas Constantinides, Deborah L. McGuinness, Jayant Kalagnanam
cs.AI

摘要

监控复杂工业资产依赖于工程师编写的符号化规则,这些规则根据传感器条件触发,并提示技术人员执行纠正操作。瓶颈不在于检测而在于响应:将规则转化为维护步骤需要多年实践积累的特定资产知识。我们探究大语言模型能否为这种从规则到行动步骤的决策提供支持,并由此提出一个包含16种资产类型中118组规则-行动对生成的6,690道专家验证多选题的基准数据集。本文贡献包括:(i) 将符号化规则转换为析取范式的题干规范化流程,结合基于嵌入的干扰项采样方法构建多选题;(ii) 针对不同失效模式设计的五种变体(专业版、扰动版、详细版、增强版、推理版);(iii) 对29个大语言模型和4种嵌入基线模型的基准测试。人类评估(9名从业者,平均准确率45.0%)证实该基准需要超越操作经验的专家知识。三项关键发现尤为突出:前沿模型差距已关闭——前三名大语言模型宏观F1值相差不超过1个百分点, Bradley-Terry Elo评分将claude-opus-4-6置于次优模型之上30分;然而专业版暴露出脆弱性——在干扰项扩展条件下所有模型相对准确率下降13%-60%;增强版暴露了模式匹配问题——在条件反转情境下,前沿模型仍以49%-63%的概率选择原始答案。部署瓶颈不在于能力而在于校准:前沿模型能处理模板式故障检测,但在结构扰动下则表现失灵。
English
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.