ChatPaper.aiChatPaper

DiagnosticIQ:基於符號規則的LLM工業維護行動建議基準

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

May 9, 2026
作者: Devin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides, Shuxin Lin, Nianjun Zhou, Paul J Adams, Sal Rosato, Nicolas Constantinides, Deborah L. McGuinness, Jayant Kalagnanam
cs.AI

摘要

監控複雜工業資產依賴工程師撰寫的符號化規則,這些規則根據感測器條件觸發,並提示技術人員執行修正動作。瓶頸不在於偵測,而在於回應:將規則轉換為維護步驟需要透過多年實務累積的資產特定知識。我們探討大型語言模型能否為此規則轉換行動步驟提供決策支援,並提出名為 的基準測試,包含來自16種資產類型、118組規則-行動配對的6,690道專家驗證選擇題。我們貢獻了:(i) 將符號規則正規化為析取範式並結合嵌入式干擾選項取樣的符號轉多選問答流程;(ii) 五種探討不同失效模式的變體(Pro、Pert、Verbose、Aug、Rationale);以及 (iii) 包含29個大型語言模型與4個嵌入基線的基準測試。一項人為評估(9位從業人員,平均正確率45.0%)證實 需要超越操作經驗的專業知識。三項發現值得關注。前沿模型差距縮小:前三名大型語言模型的宏觀分數差距在1分內,而Bradley-Terry Elo評分顯示claude-opus-4-6領先次佳模型30分。然而,Pro變體暴露了脆弱性:在干擾選項擴充下,所有模型的相對準確率下降13%至60%。Aug變體揭露了模式匹配問題:在條件倒置下,前沿模型仍有49%至63%的機率選擇原始答案。部署瓶頸不在於能力,而在於校準:前沿模型能處理模板式故障偵測,但在結構性擾動下即告失效。
English
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.