ChatPaper.aiChatPaper

WildIFEval:真實場景中的指令遵循評估

WildIFEval: Instruction Following in the Wild

March 9, 2025
作者: Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor
cs.AI

摘要

近期的大型語言模型(LLMs)在遵循用戶指令方面展現了顯著的成功,然而處理具有多重約束的指令仍然是一個重大挑戰。在本研究中,我們引入了WildIFEval——一個包含12,000條真實用戶指令的大規模數據集,這些指令涵蓋了多樣化、多約束的條件。與以往數據集不同,我們的收集範圍廣泛,涵蓋了自然用戶提示中廣泛的詞彙和主題約束。我們將這些約束分類為八個高層次類別,以捕捉其在真實場景中的分佈和動態。利用WildIFEval,我們進行了廣泛的實驗,以基準測試領先LLMs的指令遵循能力。我們的研究結果顯示,所有評估的模型在約束數量增加時都經歷了性能下降。因此,我們表明所有模型在此類任務上仍有很大的改進空間。此外,我們觀察到特定類型的約束在模型性能中扮演著關鍵角色。我們公開了我們的數據集,以促進在複雜、現實條件下指令遵循的進一步研究。
English
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

Summary

AI-Generated Summary

PDF134March 13, 2025