LLM 자가 교정과 DeCRIM: 다중 제약 조건에 따라 지침을 더 잘 따르기 위해 분해, 비평 및 정제

초록

LLM에게는 지침을 따르는 능력이 중요한 능력입니다. 그러나 최근 연구에 따르면 LLM은 종종 여러 제약 조건을 포함하는 지침에 어려움을 겪는 것으로 나타났습니다 (예: "유머러스한 어조로" 소셜 미디어 게시물을 만들라는 요청과 "해시태그 없음" 조건). 그럼에도 불구하고 대부분의 평가는 합성 데이터에만 초점을 맞추고 있습니다. 이를 해결하기 위해 우리는 RealInstruct를 소개합니다. 이는 실제 사용자가 AI 어시스턴트에게 한 쿼리를 활용하여 LLM의 실제 세계 다중 제약 조건 지침을 따르는 능력을 평가하기 위해 설계된 최초의 벤치마크입니다. 또한 이 작업에 대한 인간 주석의 비용 효율적 대안으로 모델 기반 평가를 조사합니다. 우리의 연구 결과는 심지어 소유권 GPT-4 모델도 지침의 21% 이상에서 적어도 하나의 제약 조건을 충족시키지 못한다는 한계를 강조합니다. 오픈 소스와 소유권 모델 간의 성능 차이를 해소하기 위해 우리는 Decompose, Critique 및 Refine (DeCRIM) 자가 교정 파이프라인을 제안합니다. 이는 원래 지침을 제약 조건 목록으로 분해하고 Critic 모델을 사용하여 LLM의 응답이 개선이 필요한 시기와 위치를 결정합니다. 우리의 결과는 DeCRIM이 Mistral의 성능을 RealInstruct에서 7.3%, IFEval에서 8.0% 향상시킨다는 것을 보여줍니다. 더욱이 강력한 피드백으로 오픈 소스 LLM이 DeCRIM을 사용하여 두 벤치마크에서 GPT-4를 능가할 수 있다는 것을 입증합니다.

English

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.

LLM 자가 교정과 DeCRIM: 다중 제약 조건에 따라 지침을 더 잘 따르기 위해 분해, 비평 및 정제

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

초록

Summary

Support