AssertBench: 大規模言語モデルの自己主張能力を評価するためのベンチマーク

要旨

最近のベンチマークでは、大規模言語モデル（LLM）の事実的一貫性と修辞的堅牢性が探求されてきた。しかし、事実として真である文の方向性のあるフレーミングがモデルの同意にどのように影響するかについては、LLMユーザーにとって一般的なシナリオでありながら、知識のギャップが存在する。AssertBenchは、FEVEROUSという事実検証データセットから証拠に裏打ちされた事実をサンプリングすることでこの問題に取り組む。各（証拠に基づく）事実に対して、ユーザーがその文が事実として正しいと主張するフレーミングプロンプトと、ユーザーがそれが誤りであると主張するフレーミングプロンプトの2つを構築する。その後、モデルの同意とその理由を記録する。望ましい結果は、モデルが自己主張し、両方のフレーミングにおいて一貫した真実評価を維持し、ユーザーに同意するために評価を切り替えないことである。AssertBenchは、同じ主張を中立的に提示した場合のモデルの精度に基づいて結果を層別化することで、フレーミングによる変動をモデルの基礎となる事実知識から分離する。これにより、このベンチマークは、同じ事実について矛盾するユーザーの主張を提示された際に、LLMが「自説を貫く」能力を測定することを目指している。完全なソースコードはhttps://github.com/achowd32/assert-benchで利用可能である。

English

Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.

AssertBench: 大規模言語モデルの自己主張能力を評価するためのベンチマーク

AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models

要旨

Support