谁在翻转？自我与跨模型反论证揭示LLMs中的答案不稳定性

摘要

标准准确性基准旨在测试大语言模型（LLMs）接近正确答案的程度，但不足以检验模型在面临合理反论挑战时是否会坚持正确回答。我们引入了一种受控协议来评估答案稳定性：在模型正确回答选择题后，我们用针对错误选项的连贯论证来挑战模型的答案，并测量模型是否会翻转。该设置能够：（a）将论证性内容与显性社会压力分离；（b）在论证长度、自我归因以及跨模型来源之间进行变化。在七个前沿模型和57个MMLU学科上，翻转率范围从17.5%到97.3%，揭示了仅靠准确率指标无法捕捉到的稳定性巨大差异。我们发现自我归因始终会增加翻转率（平均增加7.1个百分点，最高增加18.7个百分点）。此外，汇总各模型的错误答案论证，并为每道题选择最有效的论证，能够比依赖单一源模型产生更强的对抗性挑战。我们进一步构建了MaxFlip（一个精选挑战集），它比标准自生成挑战最多可使翻转率提高23.6个百分点。我们公开了该协议、挑战记录和MaxFlip，以支持与标准准确性基准并行的稳定性评估。相关材料可在 https://github.com/nafisenik/WhoFlips 和 https://hf.co/datasets/nafisehNik/WhoFlips 获取。

English

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.