誰翻轉？自我與跨模型反論證揭示大型語言模型中的答案不穩定性

摘要

標準準確性基準測試旨在測試大型語言模型（LLMs）接近正確答案的程度，但並不適用於測試模型在遇到合理反駁論點時是否會堅持正確答案。我們提出了一套用於評估答案穩定性的受控協議：在模型正確回答多選題後，我們以連貫的論證挑戰該模型，主張一個錯誤選項，並測量模型是否會改變答案。此設計能：a) 將論證內容與明顯的社會壓力區分開來；b) 改變論證長度、自我歸因及跨模型來源。在七個前沿模型與57個MMLU學科中，翻轉率介於17.5%至97.3%之間，顯示出僅憑準確性指標無法捕捉到的穩定性巨大差異。我們發現，自我歸因會持續提高翻轉率（平均增加7.1個百分點，最高增加18.7個百分點）。此外，彙整跨模型產生的錯誤答案論證，並針對每個問題選出最有效的論證，能產生比依賴單一來源模型更強的對抗性挑戰。我們進一步建構了MaxFlip，這是一套經過策展的挑戰集，能將翻轉率比標準自生成挑戰提高最多23.6個百分點。我們開源此協議、挑戰記錄及MaxFlip，以支援在標準準確性基準測試之外同時進行穩定性評估。相關資料可在 https://github.com/nafisenik/WhoFlips 及 https://hf.co/datasets/nafisehNik/WhoFlips 取得。

English

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.