Wie flipt? Zelf- en cross-model tegenargumenten onthullen antwoordinstabiliteit in LLM's

Samenvatting

Standaard nauwkeurigheidsbenchmarks zijn ontworpen om te testen hoe dicht grote taalmodellen (LLM's) bij correcte antwoorden komen, maar zijn niet geschikt om te testen of LLM's vasthouden aan een correct antwoord wanneer dat antwoord wordt uitgedaagd door een plausibel tegenargument. Wij introduceren een gecontroleerd protocol voor het evalueren van antwoordstabiliteit: nadat een model een meerkeuzevraag correct heeft beantwoord, dagen we het antwoord van het model uit met een coherent argument voor een incorrecte optie en meten we of het model omdraait. De opzet a) isoleert argumentatieve inhoud van openlijke sociale druk en b) varieert argumentlengte, zelfattributie en cross-model bron. Voor zeven geavanceerde modellen en 57 MMLU-onderdelen variëren de omslagpercentages van 17,5% tot 97,3%, wat grote verschillen in stabiliteit onthult die niet worden opgevangen door alleen nauwkeurigheidsmetingen. We vinden dat zelfattributie de omslagpercentages consistent verhoogt (gemiddeld +7,1pp, tot +18,7pp). Ook leidt het samenvoegen van argumenten voor foute antwoorden over modellen heen en het selecteren van het meest effectieve argument per vraag tot sterkere tegenargumenten dan vertrouwen op een enkele bronmodel. We construeren verder MaxFlip, een samengestelde uitdagingsset die omdraaiingen tot +23,6pp versterkt ten opzichte van standaard zelfgegenereerde uitdagingen. We publiceren het protocol, de uitdagingsrecords en MaxFlip om stabiliteitsevaluatie naast standaard nauwkeurigheidsbenchmarks te ondersteunen. Materialen zijn beschikbaar op https://github.com/nafisenik/WhoFlips en https://hf.co/datasets/nafisehNik/WhoFlips.

English

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.