逆向IFEval：大型語言模型能否擺脫頑固的訓練慣例，遵循真實指令？

摘要

大型語言模型（LLMs）在多樣任務上展現出強勁性能，但常表現出認知慣性，難以遵循與監督微調（SFT）期間學習到的標準化模式相衝突的指令。為評估此限制，我們提出逆向IFEval基準，該基準衡量模型的反直覺能力——即其克服訓練誘導偏見並遵從對抗性指令的能力。逆向IFEval引入了八類此類挑戰，包括問題修正、故意文本缺陷、無註釋代碼及反事實回答等。通過人機協作流程，我們構建了一個包含1012道高質量中英文問題的數據集，涵蓋23個領域，並在優化的LLM-as-a-Judge框架下進行評估。對現有領先LLMs的實驗證明了我們提出的逆向IFEval基準的必要性。研究結果強調，未來的對齊努力不僅應追求流暢性和事實正確性，還應考慮在非傳統情境下的適應性。我們希望逆向IFEval不僅作為診斷工具，更能為開發減輕認知慣性、減少對狹窄模式過擬合的方法奠定基礎，最終提升LLMs在多樣且不可預測的現實場景中遵循指令的可靠性。

English

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

逆向IFEval：大型語言模型能否擺脫頑固的訓練慣例，遵循真實指令？

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

摘要

Support