ChatPaper.aiChatPaper

谁在翻转?自我与跨模型反论证揭示LLMs中的答案不稳定性

Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs

June 14, 2026
作者: Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner
cs.AI

摘要

标准准确性基准旨在测试大语言模型(LLMs)接近正确答案的程度,但不足以检验模型在面临合理反论挑战时是否会坚持正确回答。我们引入了一种受控协议来评估答案稳定性:在模型正确回答选择题后,我们用针对错误选项的连贯论证来挑战模型的答案,并测量模型是否会翻转。该设置能够:(a)将论证性内容与显性社会压力分离;(b)在论证长度、自我归因以及跨模型来源之间进行变化。在七个前沿模型和57个MMLU学科上,翻转率范围从17.5%到97.3%,揭示了仅靠准确率指标无法捕捉到的稳定性巨大差异。我们发现自我归因始终会增加翻转率(平均增加7.1个百分点,最高增加18.7个百分点)。此外,汇总各模型的错误答案论证,并为每道题选择最有效的论证,能够比依赖单一源模型产生更强的对抗性挑战。我们进一步构建了MaxFlip(一个精选挑战集),它比标准自生成挑战最多可使翻转率提高23.6个百分点。我们公开了该协议、挑战记录和MaxFlip,以支持与标准准确性基准并行的稳定性评估。相关材料可在 https://github.com/nafisenik/WhoFlips 和 https://hf.co/datasets/nafisehNik/WhoFlips 获取。
English
Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.