逆IFEval：LLMは頑固な訓練慣習を忘れて、実際の指示に従うことができるか？

要旨

大規模言語モデル（LLMs）は多様なタスクで高い性能を発揮するが、しばしば認知的慣性を示し、教師ありファインチューニング（SFT）で学習した標準化されたパターンと矛盾する指示に従うことに苦労する。この制限を評価するため、我々はInverse IFEvalというベンチマークを提案する。これは、モデルの反直観的能力―訓練によって生じたバイアスを上書きし、敵対的な指示に従う能力―を測定するものである。Inverse IFEvalは、Question Correction（質問修正）、Intentional Textual Flaws（意図的なテキストの欠陥）、Code without Comments（コメントなしのコード）、Counterfactual Answering（反事実的応答）など、8種類の課題を導入する。人間を介在させたパイプラインを用いて、23のドメインにわたる1012の高品質な中国語と英語の質問からなるデータセットを構築し、最適化されたLLM-as-a-Judgeフレームワークの下で評価を行った。既存の主要なLLMを用いた実験により、我々が提案するInverse IFEvalベンチマークの必要性が実証された。我々の研究結果は、将来のアライメントの取り組みが、流暢さと事実の正確さを追求するだけでなく、非伝統的な文脈下での適応性も考慮すべきであることを強調している。Inverse IFEvalが、認知的慣性を軽減し、狭いパターンへの過剰適合を減らし、最終的には多様で予測不可能な現実世界のシナリオにおけるLLMの指示追従の信頼性を高めるための診断ツールおよび方法論開発の基盤となることを期待する。

English

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

逆IFEval：LLMは頑固な訓練慣習を忘れて、実際の指示に従うことができるか？

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

要旨

Support