AssertBench: 대규모 언어 모델의 자기 주장 능력 평가를 위한 벤치마크

초록

최근 벤치마크들은 대규모 언어 모델(LLM)의 사실 일관성과 수사적 견고성을 탐구해 왔습니다. 그러나 사실적으로 참인 진술의 방향성 프레이밍이 모델의 동의에 미치는 영향에 대한 지식 격차가 존재하며, 이는 LLM 사용자들에게 흔히 발생하는 시나리오입니다. AssertBench는 이를 해결하기 위해 사실 검증 데이터셋인 FEVEROUS에서 증거가 뒷받침되는 사실들을 샘플링합니다. 각 (증거가 뒷받침되는) 사실에 대해, 우리는 두 가지 프레이밍 프롬프트를 구성합니다: 하나는 사용자가 해당 진술이 사실적으로 옳다고 주장하는 경우, 다른 하나는 사용자가 그것이 틀렸다고 주장하는 경우입니다. 그런 다음 모델의 동의와 추론을 기록합니다. 바람직한 결과는 모델이 스스로를 주장하며, 두 프레이밍에서 일관된 진실 평가를 유지하는 것입니다. 즉, 사용자와 동의하기 위해 평가를 바꾸는 것이 아니라 말입니다. AssertBench는 중립적으로 제시된 동일한 주장에 대한 모델의 정확도를 기반으로 결과를 계층화함으로써 프레이밍으로 인한 변동성을 모델의 기본적인 사실 지식과 분리합니다. 이를 통해, 이 벤치마크는 동일한 사실에 대해 상반된 사용자 주장이 제시될 때 LLM이 "자신의 입장을 고수"하는 능력을 측정하는 것을 목표로 합니다. 전체 소스 코드는 https://github.com/achowd32/assert-bench에서 확인할 수 있습니다.

English

Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.

AssertBench: 대규모 언어 모델의 자기 주장 능력 평가를 위한 벤치마크

AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models

초록

Support