LLM自我校正与DeCRIM:拆分、批判和细化,以增强对具有多个约束条件的指令的跟踪
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
October 9, 2024
作者: Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung, Mohit Bansal, Nanyun Peng
cs.AI
摘要
指令遵循是LLM的关键能力。然而,最近的研究表明,LLM经常在包含多个约束的指令(例如,要求以“幽默的口吻”创建社交媒体帖子,并“不使用标签”)方面遇到困难。尽管如此,大多数评估仅关注合成数据。为了解决这一问题,我们引入了RealInstruct,这是第一个旨在评估LLM遵循现实世界多重约束指令能力的基准,利用了真实用户向AI助手提出的查询。我们还研究了基于模型的评估作为这一任务的一种经济有效的人工标注替代方案。我们的研究结果显示,即使专有的GPT-4模型在超过21%的指令中至少无法满足一个约束,突显了现有模型的局限性。为了解决开源和专有模型之间的性能差距,我们提出了“分解、批判和完善”(DeCRIM)自我校正流程,该流程增强了LLM遵循约束的能力。DeCRIM通过将原始指令分解为约束列表,并使用批评模型来决定LLM响应何时何地需要完善。我们的结果显示,即使在弱反馈情况下,DeCRIM也使Mistral在RealInstruct上的性能提高了7.3%,在IFEval上提高了8.0%。此外,我们证明,在强反馈的情况下,具有DeCRIM的开源LLM可以在两个基准测试中均胜过GPT-4。
English
Instruction following is a key capability for LLMs. However, recent studies
have shown that LLMs often struggle with instructions containing multiple
constraints (e.g. a request to create a social media post "in a funny tone"
with "no hashtag"). Despite this, most evaluations focus solely on synthetic
data. To address this, we introduce RealInstruct, the first benchmark designed
to evaluate LLMs' ability to follow real-world multi-constrained instructions
by leveraging queries real users asked AI assistants. We also investigate
model-based evaluation as a cost-effective alternative to human annotation for
this task. Our findings reveal that even the proprietary GPT-4 model fails to
meet at least one constraint on over 21% of instructions, highlighting the
limitations of state-of-the-art models. To address the performance gap between
open-source and proprietary models, we propose the Decompose, Critique and
Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to
follow constraints. DeCRIM works by decomposing the original instruction into a
list of constraints and using a Critic model to decide when and where the LLM's
response needs refinement. Our results show that DeCRIM improves Mistral's
performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
Moreover, we demonstrate that with strong feedback, open-source LLMs with
DeCRIM can outperform GPT-4 on both benchmarks.Summary
AI-Generated Summary