LLM自我校正與DeCRIM:拆解、評論和優化,以增強對具有多重限制的指令的遵循
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
October 9, 2024
作者: Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng
cs.AI
摘要
LLM的指令遵循是一項關鍵能力。然而,最近的研究顯示,LLM通常在包含多個限制條件的指令中遇到困難(例如,要求以“幽默的口吻”創建社交媒體帖子,但“不得使用標籤”)。儘管如此,大多數評估都僅關注合成數據。為了應對這一問題,我們引入了RealInstruct,這是第一個旨在評估LLM遵循現實世界多限制條件指令能力的基準,利用了真實用戶向AI助手提出的查詢。我們還研究了基於模型的評估作為這一任務的一種成本效益替代方案,我們的研究結果顯示,即使專有的GPT-4模型在超過21%的指令中至少無法滿足一個限制條件,突顯了最先進模型的局限性。為了彌補開源和專有模型之間的性能差距,我們提出了Decompose, Critique and Refine(DeCRIM)自我校正流程,該流程通過將原始指令分解為一系列限制條件,並使用評論模型來決定LLM的回應何時以及在哪裡需要進行改進。我們的結果顯示,即使反饋較弱,DeCRIM也能使Mistral在RealInstruct上的表現提高7.3%,在IFEval上提高8.0%。此外,我們展示了在強反饋的情況下,具有DeCRIM的開源LLM可以在兩個基準測試中優於GPT-4。
English
Instruction following is a key capability for LLMs. However, recent studies
have shown that LLMs often struggle with instructions containing multiple
constraints (e.g. a request to create a social media post "in a funny tone"
with "no hashtag"). Despite this, most evaluations focus solely on synthetic
data. To address this, we introduce RealInstruct, the first benchmark designed
to evaluate LLMs' ability to follow real-world multi-constrained instructions
by leveraging queries real users asked AI assistants. We also investigate
model-based evaluation as a cost-effective alternative to human annotation for
this task. Our findings reveal that even the proprietary GPT-4 model fails to
meet at least one constraint on over 21% of instructions, highlighting the
limitations of state-of-the-art models. To address the performance gap between
open-source and proprietary models, we propose the Decompose, Critique and
Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to
follow constraints. DeCRIM works by decomposing the original instruction into a
list of constraints and using a Critic model to decide when and where the LLM's
response needs refinement. Our results show that DeCRIM improves Mistral's
performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
Moreover, we demonstrate that with strong feedback, open-source LLMs with
DeCRIM can outperform GPT-4 on both benchmarks.Summary
AI-Generated Summary