AssertBench:評估大型語言模型自我斷言之基準
AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
June 8, 2025
作者: Jaeho Lee, Atharv Chowdhary
cs.AI
摘要
近期基准测试已深入探究了大语言模型(LLMs)在事实一致性与修辞稳健性方面的表现。然而,关于事实真实陈述的方向性框架如何影响模型同意度,这一LLM用户常见场景的知识空白尚存。AssertBench通过从事实验证数据集FEVEROUS中抽样证据支持的事实来应对这一问题。对于每一条(有证据支持的)事实,我们构建了两个框架提示:一种提示中用户声称该陈述在事实上是正确的,另一种则声称其不正确。随后,我们记录模型的同意度及其推理过程。理想的结果是,模型能够坚持己见,在两种框架下保持一致的真相评估,而非为了迎合用户而改变其评估。AssertBench通过将结果分层,基于模型在相同主张以中立方式呈现时的准确性,隔离了框架诱导的变异性与模型底层事实知识。借此,该基准测试旨在衡量LLM在面对用户关于同一事实的相反断言时,能否“坚守立场”。完整源代码可访问https://github.com/achowd32/assert-bench获取。
English
Recent benchmarks have probed factual consistency and rhetorical robustness
in Large Language Models (LLMs). However, a knowledge gap exists regarding how
directional framing of factually true statements influences model agreement, a
common scenario for LLM users. AssertBench addresses this by sampling
evidence-supported facts from FEVEROUS, a fact verification dataset. For each
(evidence-backed) fact, we construct two framing prompts: one where the user
claims the statement is factually correct, and another where the user claims it
is incorrect. We then record the model's agreement and reasoning. The desired
outcome is that the model asserts itself, maintaining consistent truth
evaluation across both framings, rather than switching its evaluation to agree
with the user. AssertBench isolates framing-induced variability from the
model's underlying factual knowledge by stratifying results based on the
model's accuracy on the same claims when presented neutrally. In doing so, this
benchmark aims to measure an LLM's ability to "stick to its guns" when
presented with contradictory user assertions about the same fact. The complete
source code is available at https://github.com/achowd32/assert-bench.