通過口語化操作進行指示遵循評估
Instruction-following Evaluation through Verbalizer Manipulation
July 20, 2023
作者: Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, Hongxia Jin
cs.AI
摘要
儘管調整指令的模型在各種自然語言處理任務中取得了顯著成功,但準確評估其遵循指令的能力仍然具有挑戰性。現有的基準主要集中在與模型在訓練期間學習的內容相符的常見指令上。然而,對這些指令的回應能力並不一定意味著具有強大的遵循指令能力。在本文中,我們提出了一種名為「口語化操作」的新型指令遵循評估協議。它指示模型用與模型先驗知識程度不同程度相符的詞語來口頭表達任務標籤,從高度對齊(例如,對於正面情感輸出“正面”)到最小程度對齊(例如,對於正面情感輸出“負面”)。口語化操作可以與任何分類基準無縫集成,以檢查模型對先驗知識的依賴程度以及其覆蓋它們以準確遵循指令的能力。我們對四個主要模型系列在九個數據集上進行了全面評估,對每個模型系列使用了十二組口語化操作。我們觀察到,模型在遵循指令的能力上,跨不同系列和規模,明顯地取決於它們對於不太自然口語化操作的表現。即使最強大的 GPT-4 模型在最具挑戰性的口語化操作上也難以比隨機猜測表現更好,強調了繼續改進其遵循指令能力的必要性。
English
While instruction-tuned models have shown remarkable success in various
natural language processing tasks, accurately evaluating their ability to
follow instructions remains challenging. Existing benchmarks primarily focus on
common instructions that align well with what the model learned during
training. However, proficiency in responding to these instructions does not
necessarily imply strong ability in instruction following. In this paper, we
propose a novel instruction-following evaluation protocol called verbalizer
manipulation. It instructs the model to verbalize the task label with words
aligning with model priors to different extents, adopting verbalizers from
highly aligned (e.g., outputting ``postive'' for positive sentiment), to
minimally aligned (e.g., outputting ``negative'' for positive sentiment).
Verbalizer manipulation can be seamlessly integrated with any classification
benchmark to examine the model's reliance on priors and its ability to override
them to accurately follow the instructions. We conduct a comprehensive
evaluation of four major model families across nine datasets, employing twelve
sets of verbalizers for each of them. We observe that the instruction-following
abilities of models, across different families and scales, are significantly
distinguished by their performance on less natural verbalizers. Even the
strongest GPT-4 model struggles to perform better than random guessing on the
most challenging verbalizer, emphasizing the need for continued advancements to
improve their instruction-following abilities.