逆向IFEval：大型语言模型能否摒弃顽固的训练惯例，真正遵循实际指令？

摘要

大型语言模型（LLMs）在多样化任务上展现出强劲性能，但常表现出认知惯性，难以遵循与监督微调（SFT）期间学到的标准化模式相冲突的指令。为评估这一局限，我们提出逆向IFEval基准，旨在衡量模型的反直觉能力——即其克服训练诱导的偏见并遵从对抗性指令的能力。逆向IFEval引入了八类此类挑战，包括问题修正、故意文本缺陷、无注释代码及反事实回答等。通过人机协作流程，我们构建了一个包含1012道高质量中英文问题的数据集，涵盖23个领域，并在优化的LLM-as-a-Judge框架下进行评估。对现有领先LLMs的实验验证了逆向IFEval基准的必要性。我们的研究强调，未来的对齐工作不仅应追求流畅性与事实准确性，还需考虑在非传统情境下的适应能力。我们期望逆向IFEval既作为诊断工具，又为开发方法奠定基础，以缓解认知惯性，减少对狭窄模式的过拟合，最终提升LLMs在多样且不可预测的现实场景中遵循指令的可靠性。

English

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

逆向IFEval：大型语言模型能否摒弃顽固的训练惯例，真正遵循实际指令？

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

摘要

Support