ChatPaper.aiChatPaper

IndustryBench:探测大语言模型的工业知识边界

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

May 11, 2026
作者: Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding
cs.AI

摘要

在工业采购场景中,大型语言模型的回答只有通过标准审核才有实际价值:推荐的材料必须符合工况条件,每个参数都须在法规阈值内,任何操作流程都不能与安全条款相冲突。部分正确性可能掩盖安全关键矛盾,而这些矛盾往往被聚合型LLM基准测试所忽略。我们提出IndustryBench,这是一个包含2049个条目的中文工业采购问答基准,基于中国国家标准(GB/T)和结构化工业产品记录构建,按七大能力维度、十个行业类别及专家小组划分的难度等级组织,并提供条目对应的英文、俄文和越南文译本。我们的构建流程在基于搜索的外部验证阶段拒绝了70.3%的LLM生成候选答案,揭示了仅依靠LLM过滤时工业问答的不可靠程度。我们的评估将原始正确性——由经领域专家验证的Qwen3-Max裁判(κ_w=0.798)评分——与基于源文本的安全违规检查分开进行。在17个中文模型及8个跨四种语言模型的测试中,我们发现:(i) 最佳系统在0-3分制下仅得2.083分,存在显著提升空间;(ii) "标准与术语"是最持续的能力短板,且该问题在跨语言条目翻译中仍然存在;(iii) 扩展推理降低了13个模型中12个的安全调整分数,主要源于在较长最终答案中引入无依据的安全关键细节;(iv) 安全违规率重新洗牌了排名——GPT-5.4经安全违规调整后从第6位升至第3位,而Kimi-k2.5-1T-A32B则下降7个位次。因此,工业LLM评估需要基于源文本的安全感知诊断,而非聚合准确率。我们开放了IndustryBench,包含所有提示词、评分脚本及数据集文档。
English
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
PDF21May 14, 2026