IndustryBench: 대규모 언어 모델의 산업 지식 경계 탐구

초록

산업 조달에서 LLM 답변은 규격 검사를 통과해야만 실용적이다. 즉, 권장 자재는 운전 조건과 일치해야 하며, 모든 변수는 규제 임계값을 준수해야 하고, 어떤 절차도 안전 조항을 위반해서는 안 된다. 부분적 정확성은 LLM 종합 벤치마크가 거의 포착하지 못하는 안전-중대 모순을 가릴 수 있다. 우리는 IndustryBench를 소개한다. 이는 중국 국가 표준(GB/T)과 구조화된 산업 제품 기록에 기반하여, 7가지 능력 차원, 10개 산업 범주, 패널 기반 난이도 계층으로 구성된 2,049개 항목의 중국어 산업 조달 QA 벤치마크이며, 각 항목에 대해 영어, 러시아어, 베트남어 번역문을 제공한다. 우리의 구축 파이프라인은 검색 기반 외부 검증 단계에서 LLM 생성 후보의 70.3%를 기각하여, LLM 단독 필터링 이후에도 산업 QA가 얼마나 신뢰할 수 없는지를 보정한다. 우리의 평가는 Qwen3-Max 판정자가 도메인 전문가 대비 κ_w = 0.798로 검증된 원시 정확도 점수와, 원천 텍스트 대비 별도의 안전 위반(SV) 점검을 분리한다. 중국어 17개 모델과 4개 언어에 걸친 8개 모델 교집합 평가에서 다음을 발견했다: (i) 최고 시스템도 0–3 척도에서 2.083에 불과하여 상당한 개선 여지가 남아 있다; (ii) 표준 및 용어가 가장 지속적인 능력 취약점이며, 항목 정렬 번역에서도 유지된다; (iii) 확장된 추론은 13개 모델 중 12개 모델의 안전 조정 점수를 낮추는데, 주로 더 긴 최종 답변에 뒷받침되지 않는 안전-중대 세부 사항을 도입하기 때문이다; (iv) 안전 위반 비율이 리더보드를 재편한다 — GPT-5.4는 SV 조정 후 순위 6에서 3위로 상승하는 반면, Kimi-k2.5-1T-A32B는 7계단 하락한다. 따라서 산업용 LLM 평가는 종합 정확도가 아닌, 원천에 기반하고 안전을 인식하는 진단을 필요로 한다. 우리는 IndustryBench를 모든 프롬프트, 채점 스크립트, 데이터셋 문서와 함께 공개한다.

English

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering.Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at κ_w = 0.798 against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions.Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

IndustryBench: 대규모 언어 모델의 산업 지식 경계 탐구

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

초록

Support