KWBench: 지식 작업에서 비프롬프트 문제 인식 측정

초록

본 논문은 대규모 언어 모델의 자발적 문제 인식 능력을 평가하기 위한 첫 번째 버전의 KWBench(Knowledge Work Bench) 벤치마크를 소개합니다. 즉, LLM이 문제 해결을 시도하기 전에 전문가 수준의 시나리오를 식별할 수 있는지 평가합니다. 기존 최첨단 벤치마크는 포화 상태에 이르렀으며, 현재까지의 대부분의 지식 작업 평가는 사양에 따른 정보 추출이나 과제 수행으로 축소되는 경향이 있습니다. KWBench는 그 이전 단계, 즉 원시 입력만으로 상황을 지배하는 구조를 인식하는 능력을 목표로 합니다. 이 벤치마크는 조달, 계약 협상, 임상 약학, 조직 정치, 사기 분석, 인센티브 설계 등 다양한 분야의 실무자들로부터 수집된 223개의 과제로 구성됩니다. 각 과제는 공식적인 게임 이론적 패턴(주인-대리인 갈등, 시그널링, 메커니즘 설계 실패, 전략적 생략, 연합 역학, 전략적 상호의존성)을 내포하며, 전문가의 상황 판단과 예상 실패 모드를 구조화된 정답 데이터로 기록합니다. 모델은 문제 유형에 대한 어떠한 힌트도 없는 원시 데이터와 과제 지시를 받습니다. 채점은 필수 결합 조건 확인을 통과해야 하는 3단계 평가 기준으로 이루어집니다. 필수 기준은 예측된 오류 경로를 코드화합니다. 총 16개의 모델을 평가한 결과, 최고 성능 모델은 과제의 27.9%만을 통과했습니다. 상위 두 모델은 통과한 과제 중에서도 31.7%만 일치했습니다. 상위 8개 모델 중 44개의 과제는 정확히 하나의 모델만이 해결했으며, 상위 8개 모델을 모두 활용하면 벤치마크의 50.7%를 커버해 단일 최고 모델 성능의 거의 두 배에 달했습니다. 통과한 과제에 한정하면 품질 점수는 약 83%로 모델 간 수렴하는 반면, 전체 무조건 점수는 그렇지 않았습니다. 동일 모델들은 질문을 받으면 관련 게임 이론 개념을 정확히 설명하지만, 별도의 지시 없이는 이를 적용하지 못했습니다. 우리는 KWBench를 공개하여 최첨단 모델의 지식 작업 평가 방식을 전환하고, 문제가 이미 정의된 후 실행을 얼마나 잘하는지뿐만 아니라 상황 자체에서 올바른 문제를 인식하는지에 따라 평가하고자 합니다.

English

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

KWBench: 지식 작업에서 비프롬프트 문제 인식 측정

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

초록

Support