DiagnosticIQ：記号ルールに基づくLLMによる産業用メンテナンス行動推薦のためのベンチマーク

要旨

複雑な産業資産の監視は、センサーの状態に基づいて発動し、技術者に是正措置を促す、技術者作成の記号ルールに依存している。ボトルネックは検出ではなく対応にある。すなわち、ルールを保守手順に変換するには、長年の実践を通じて獲得される資産固有の知識が必要となる。本稿では、大規模言語モデルがこのルールからアクションへのステップにおける意思決定支援として機能するかを調査し、16の資産タイプにわたる118のルール・アクションペアから抽出された6,690問の専門家検証済み多肢選択問題からなるベンチマークを導入する。我々は、(i) ルールを選言標準形に正規化し、埋め込みベースの誤答選択肢サンプリングを行う記号からMCQAへのパイプライン、(ii) 異なる障害モードを探る5つのバリアント（Pro, Pert, Verbose, Aug, Rationale）、(iii) 29の大規模言語モデルと4つの埋め込みベースラインモデルによるベンチマークを提供する。人間による評価（実務者9名、平均45.0%）は、本ベンチマークが運用経験を超える専門知識を必要とすることを確認している。3つの発見が際立つ。最前線は収束している。上位3つの大規模言語モデルは1マクロポイント以内に収まっており、Bradley-Terry Eloではclaude-opus-4-6が次点モデルより30ポイント高い。しかしながら、Proバリアントは脆さを露呈し、すべてのモデルが誤答選択肢拡大下で相対精度を13～60%低下させる。Augバリアントはパターンマッチングを露呈し、条件反転下でも最前線モデルは49～63%の確率で元の回答を選択する。導入におけるボトルネックは能力ではなくキャリブレーションにある。最前線モデルはテンプレート形式の故障検出を処理できるが、構造的摂動の下では破綻する。

English

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce , a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.