IndustryBench：探測大語言模型的工業知識邊界

摘要

在工業採購中，大型語言模型的回答唯有通過標準檢驗才具實用價值：所推薦的材料必須符合操作條件，每個參數皆須遵循法規閾值，且任何程序皆不得牴觸安全條款。部分正確性可能掩蓋安全關鍵矛盾，而這類矛盾常被大型語言模型的整體基準測試所忽略。我們提出 IndustryBench，這是一個包含 2,049 題的工業採購問答基準，以中文為基礎，參照中國國家標準（GB/T）及結構化工業產品紀錄，按七項能力維度、十大行業類別及專家小組評定的難度分級進行組織，並提供各題目對應的英文、俄文及越南文翻譯。我們的建構流程在以搜尋為基礎的外部驗證階段，淘汰了 70.3% 由大型語言模型生成的候選題目，從而校準出僅經大型語言模型篩選後，工業問答仍多不可靠的現狀。我們的評估將原始正確性（由 Qwen3-Max 評判，其與領域專家的一致性加權卡帕係數 κ_w 達 0.798）與獨立的安全違規（Safety-Violation, SV）檢查（對照來源文本）加以分離。在 17 個中文模型及橫跨四種語言的 8 個模型交集上，我們發現：（i）最佳系統在 0 至 3 分量表上僅達 2.083 分，提升空間仍大；（ii）「標準與術語」是最持久的弱項能力，且經題目對應翻譯後依然存在；（iii）延伸推理降低 13 個模型中 12 個模型的安全調整後分數，主因是在較長的最終回答中引入了未經證實的安全關鍵細節；（iv）安全違規率導致排名重新洗牌——GPT-5.4 在安全違規調整後從第 6 名升至第 3 名，而 Kimi-k2.5-1T-A32B 則下滑七個名次。因此，工業用大型語言模型的評估需要以來源為基礎、具安全意識的診斷，而非整體準確率。我們將 IndustryBench 連同所有提示詞、評分腳本及數據集文件一併釋出。

English

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

IndustryBench：探測大語言模型的工業知識邊界

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

摘要

Support