역 IFEval: LLM이 고집스러운 학습 관례를 벗어나 실제 지시를 따를 수 있을까?

초록

대규모 언어 모델(LLMs)은 다양한 작업에서 강력한 성능을 보이지만, 종종 인지적 관성(cognitive inertia)을 보이며, 지도 미세 조정(supervised fine-tuning, SFT) 중 학습된 표준화된 패턴과 충돌하는 지시를 따르는 데 어려움을 겪습니다. 이러한 한계를 평가하기 위해, 우리는 모델의 반직관적 능력(Counter-intuitive Ability)을 측정하는 벤치마크인 Inverse IFEval을 제안합니다. 이 벤치마크는 훈련으로 인한 편향을 무시하고 적대적 지시를 따르는 모델의 능력을 평가합니다. Inverse IFEval은 질문 수정(Question Correction), 의도적 텍스트 결함(Intentional Textual Flaws), 주석 없는 코드(Code without Comments), 반사실적 답변(Counterfactual Answering) 등 8가지 유형의 도전 과제를 도입합니다. 인간 참여 파이프라인을 통해, 우리는 23개 도메인에 걸쳐 1012개의 고품질 중국어 및 영어 질문으로 구성된 데이터셋을 구축하고, 최적화된 LLM-as-a-Judge 프레임워크 하에서 평가했습니다. 기존 주요 LLMs에 대한 실험은 우리가 제안한 Inverse IFEval 벤치마크의 필요성을 입증합니다. 우리의 연구 결과는 향후 정렬(alignment) 노력이 유창성과 사실적 정확성뿐만 아니라 비전통적 맥락에서의 적응성도 고려해야 함을 강조합니다. 우리는 Inverse IFEval이 인지적 관성을 완화하고, 좁은 패턴에 대한 과적합을 줄이며, 궁극적으로 다양한 예측 불가능한 실제 시나리오에서 LLMs의 지시 따르기 신뢰성을 향상시키는 방법 개발을 위한 진단 도구 및 기반으로 활용되기를 바랍니다.

English

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

역 IFEval: LLM이 고집스러운 학습 관례를 벗어나 실제 지시를 따를 수 있을까?

Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

초록

Support