DiagnosticIQ：基于符号规则的大语言模型工业维护行动推荐基准

摘要

监控复杂工业资产依赖于工程师编写的符号化规则，这些规则根据传感器条件触发，并提示技术人员执行纠正操作。瓶颈不在于检测而在于响应：将规则转化为维护步骤需要多年实践积累的特定资产知识。我们探究大语言模型能否为这种从规则到行动步骤的决策提供支持，并由此提出一个包含16种资产类型中118组规则-行动对生成的6,690道专家验证多选题的基准数据集。本文贡献包括：(i) 将符号化规则转换为析取范式的题干规范化流程，结合基于嵌入的干扰项采样方法构建多选题；(ii) 针对不同失效模式设计的五种变体（专业版、扰动版、详细版、增强版、推理版）；(iii) 对29个大语言模型和4种嵌入基线模型的基准测试。人类评估（9名从业者，平均准确率45.0%）证实该基准需要超越操作经验的专家知识。三项关键发现尤为突出：前沿模型差距已关闭——前三名大语言模型宏观F1值相差不超过1个百分点， Bradley-Terry Elo评分将claude-opus-4-6置于次优模型之上30分；然而专业版暴露出脆弱性——在干扰项扩展条件下所有模型相对准确率下降13%-60%；增强版暴露了模式匹配问题——在条件反转情境下，前沿模型仍以49%-63%的概率选择原始答案。部署瓶颈不在于能力而在于校准：前沿模型能处理模板式故障检测，但在结构扰动下则表现失灵。

English

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.