IndustryBench: LLMの産業知識の境界を探る

要旨

産業調達において、LLMの回答が有用であるのは、それが基準チェックを通過した場合に限られる。すなわち、推奨材料が動作条件に適合し、すべてのパラメータが規制閾値を遵守し、いかなる手順も安全条項に矛盾してはならない。部分的な正しさは、総合的なLLMベンチマークではほとんど捉えられない、安全上重大な矛盾を覆い隠す可能性がある。本稿では、中国語による産業調達QAのための2,049項目のベンチマークであるIndustryBenchを紹介する。これは中国国家規格（GB/T）と構造化された産業製品記録に基づき、7つの能力次元、10の業種カテゴリ、パネルにより導出された難易度レベルで構成され、項目ごとに整合した英語、ロシア語、ベトナム語の翻訳を備える。我々の構築パイプラインは、検索ベースの外部検証段階においてLLMが生成した候補の70.3%を却下し、LLMのみのフィルタリング後も産業用QAがいかに信頼性に欠けるかを較正する。評価では、生の正しさ（Qwen3-Max判定器によりスコア化。ドメイン専門家との一致度κ_w = 0.798で検証済み）と、ソーステキストに対する別個の安全違反（SV）チェックとを分離する。中国語の17モデル、および4言語にわたる8モデルの共通部分において、以下の知見を得た。（i）最良のシステムでも0～3のルーブリックで2.083に留まり、大きな改善の余地が残る。（ii）「規格と用語」が最も根強い能力の弱点であり、項目整合翻訳後も残存する。（iii）拡張推論は13モデル中12モデルで安全調整スコアを低下させ、主として長い最終回答に裏付けのない安全上重要な詳細を導入することによる。（iv）安全違反率によりリーダーボードが再編される。GPT-5.4はSV調整後、6位から3位に上昇する一方、Kimi-k2.5-1T-A32Bは7位下落する。したがって、産業用LLM評価には、総合的な精度ではなく、ソースに基づいた安全性を考慮した診断が必要である。我々は、すべてのプロンプト、スコアリングスクリプト、データセットドキュメントとともにIndustryBenchを公開する。

English

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

IndustryBench: LLMの産業知識の境界を探る

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

要旨

Support