누가 답을 바꾸는가? 자기 모델 및 교차 모델 반론이 LLM의 답변 불안정성을 드러내다

초록

표준 정확도 벤치마크는 대규모 언어 모델(LLM)이 정답에 얼마나 근접하는지 테스트하기 위해 설계되었지만, 모델이 그럴듯한 반론에 의해 해당 답변이 도전받을 때 정답을 고수하는지 여부를 테스트하는 데는 적합하지 않습니다. 우리는 답변 안정성을 평가하기 위한 통제된 프로토콜을 소개합니다. 모델이 객관식 질문에 정답을 맞힌 후, 잘못된 선택지를 지지하는 일관된 논증으로 모델의 답변에 도전하고 모델이 답을 바꾸는지 측정합니다. 이 설정은 a) 논증 내용을 명백한 사회적 압력으로부터 분리하고, b) 논증 길이, 자기 귀인, 교차 모델 출처를 변화시킵니다. 7개의 최첨단 모델과 57개의 MMLU 주제에 걸쳐, 답변 전환율은 17.5%에서 97.3%까지 다양하며, 정확도 지표만으로는 포착되지 않는 안정성의 큰 차이를 드러냅니다. 우리는 자기 귀인이 일관되게 전환율을 증가시킨다는 것을 발견했습니다(평균 +7.1%p, 최대 +18.7%p). 또한, 여러 모델의 오답 논증을 풀링(pooling)하고 질문별로 가장 효과적인 논증을 선택하면 단일 출처 모델에 의존하는 것보다 더 강력한 적대적 도전 과제를 생성합니다. 우리는 추가로 표준 자기 생성 도전 과제에 비해 전환율을 최대 +23.6%p 증폭시키는 선별된 도전 세트인 MaxFlip을 구축합니다. 우리는 표준 정확도 벤치마크와 함께 안정성 평가를 지원하기 위해 프로토콜, 도전 기록, MaxFlip을 공개합니다. 자료는 https://github.com/nafisenik/WhoFlips 및 https://hf.co/datasets/nafisehNik/WhoFlips에서 확인할 수 있습니다.

English

Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp). Also, pooling wrong-answer arguments across models and selecting the most effective one per question yields stronger adversarial challenges than relying on any single source model. We further construct MaxFlip, a curated challenge set that amplifies flips by up to +23.6pp over standard self-generated challenges. We release the protocol, challenge records, and MaxFlip to support stability evaluation alongside standard accuracy benchmarks. Materials are available at https://github.com/nafisenik/WhoFlips and https://hf.co/datasets/nafisehNik/WhoFlips.