誰が回答を変更するのか？自己反論とモデル間反論がLLMの回答の不安定性を明らかにする

要旨

標準的な正確性ベンチマークは、大規模言語モデル（LLM）がどれだけ正確に正答に近づくかを検証するために設計されているが、その正答がもっともらしい反論によって挑戦された場合に、モデルが正答を維持できるかどうかをテストするのには適していない。我々は、回答の安定性を評価するための制御されたプロトコルを導入する。すなわち、モデルが多肢選択問題に正答した後、誤った選択肢に対する一貫性のある議論でモデルの回答に挑戦し、モデルが回答を変更するかどうかを測定する。この設定は、a) 議論内容を露骨な社会的圧力から切り離し、b) 議論の長さ、自己帰属、および異なるモデルに由来する情報源を変化させる。7つの最先端モデルと57のMMLU科目にわたって、フリップ率は17.5%から97.3%の範囲を示し、正確性指標のみでは捉えられない大きな安定性の差異が明らかになった。自己帰属は一貫してフリップ率を上昇させることがわかった（平均+7.1パーセントポイント、最大+18.7パーセントポイント）。また、誤答の議論をモデル間でプールし、質問ごとに最も効果的なものを選択することで、単一の情報源モデルに依存するよりも強力な敵対的チャレンジが得られる。さらに、最大フリップを誘発する厳選されたチャレンジセットであるMaxFlipを構築し、標準的な自己生成チャレンジと比較して最大+23.6パーセントポイントのフリップ増加を達成した。我々は、標準的な正確性ベンチマークと並行して安定性評価をサポートするために、プロトコル、チャレンジ記録、およびMaxFlipを公開する。資料はhttps://github.com/nafisenik/WhoFlipsおよびhttps://hf.co/datasets/nafisehNik/WhoFlipsで入手可能である。

English

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.