IndustryBench：探测大语言模型的工业知识边界

摘要

在工业采购场景中，大型语言模型的回答只有通过标准审核才有实际价值：推荐的材料必须符合工况条件，每个参数都须在法规阈值内，任何操作流程都不能与安全条款相冲突。部分正确性可能掩盖安全关键矛盾，而这些矛盾往往被聚合型LLM基准测试所忽略。我们提出IndustryBench，这是一个包含2049个条目的中文工业采购问答基准，基于中国国家标准（GB/T）和结构化工业产品记录构建，按七大能力维度、十个行业类别及专家小组划分的难度等级组织，并提供条目对应的英文、俄文和越南文译本。我们的构建流程在基于搜索的外部验证阶段拒绝了70.3%的LLM生成候选答案，揭示了仅依靠LLM过滤时工业问答的不可靠程度。我们的评估将原始正确性——由经领域专家验证的Qwen3-Max裁判（κ_w=0.798）评分——与基于源文本的安全违规检查分开进行。在17个中文模型及8个跨四种语言模型的测试中，我们发现：(i) 最佳系统在0-3分制下仅得2.083分，存在显著提升空间；(ii) "标准与术语"是最持续的能力短板，且该问题在跨语言条目翻译中仍然存在；(iii) 扩展推理降低了13个模型中12个的安全调整分数，主要源于在较长最终答案中引入无依据的安全关键细节；(iv) 安全违规率重新洗牌了排名——GPT-5.4经安全违规调整后从第6位升至第3位，而Kimi-k2.5-1T-A32B则下降7个位次。因此，工业LLM评估需要基于源文本的安全感知诊断，而非聚合准确率。我们开放了IndustryBench，包含所有提示词、评分脚本及数据集文档。

English

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

IndustryBench：探测大语言模型的工业知识边界

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

摘要

Support