ChatPaper.aiChatPaper

IndustryBench:探測大語言模型的工業知識邊界

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

May 11, 2026
作者: Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding
cs.AI

摘要

在工業採購中,大型語言模型的回答唯有通過標準檢驗才具實用價值:所推薦的材料必須符合操作條件,每個參數皆須遵循法規閾值,且任何程序皆不得牴觸安全條款。部分正確性可能掩蓋安全關鍵矛盾,而這類矛盾常被大型語言模型的整體基準測試所忽略。我們提出 IndustryBench,這是一個包含 2,049 題的工業採購問答基準,以中文為基礎,參照中國國家標準(GB/T)及結構化工業產品紀錄,按七項能力維度、十大行業類別及專家小組評定的難度分級進行組織,並提供各題目對應的英文、俄文及越南文翻譯。我們的建構流程在以搜尋為基礎的外部驗證階段,淘汰了 70.3% 由大型語言模型生成的候選題目,從而校準出僅經大型語言模型篩選後,工業問答仍多不可靠的現狀。 我們的評估將原始正確性(由 Qwen3-Max 評判,其與領域專家的一致性加權卡帕係數 κ_w 達 0.798)與獨立的安全違規(Safety-Violation, SV)檢查(對照來源文本)加以分離。在 17 個中文模型及橫跨四種語言的 8 個模型交集上,我們發現:(i)最佳系統在 0 至 3 分量表上僅達 2.083 分,提升空間仍大;(ii)「標準與術語」是最持久的弱項能力,且經題目對應翻譯後依然存在;(iii)延伸推理降低 13 個模型中 12 個模型的安全調整後分數,主因是在較長的最終回答中引入了未經證實的安全關鍵細節;(iv)安全違規率導致排名重新洗牌——GPT-5.4 在安全違規調整後從第 6 名升至第 3 名,而 Kimi-k2.5-1T-A32B 則下滑七個名次。 因此,工業用大型語言模型的評估需要以來源為基礎、具安全意識的診斷,而非整體準確率。我們將 IndustryBench 連同所有提示詞、評分腳本及數據集文件一併釋出。
English
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.
PDF21May 14, 2026