LLM Autocorrezione con DeCRIM: Decomporre, Criticare e Affinare per un Miglior Seguimento delle Istruzioni con Molteplici Vincoli

Abstract

La capacità di seguire istruzioni è una competenza chiave per i LLM. Tuttavia, studi recenti hanno dimostrato che i LLM spesso faticano con istruzioni contenenti molteplici vincoli (ad esempio, una richiesta di creare un post sui social media "con un tono divertente" senza "hashtag"). Nonostante ciò, la maggior parte delle valutazioni si concentra esclusivamente sui dati sintetici. Per affrontare questo problema, presentiamo RealInstruct, il primo benchmark progettato per valutare la capacità dei LLM di seguire istruzioni del mondo reale multi-vincolate sfruttando le query che gli utenti reali hanno posto agli assistenti AI. Esaminiamo anche la valutazione basata sul modello come alternativa economica all'annotazione umana per questo compito. I nostri risultati rivelano che persino il modello proprietario GPT-4 non riesce a rispettare almeno un vincolo su oltre il 21% delle istruzioni, evidenziando i limiti dei modelli all'avanguardia. Per affrontare il divario di prestazioni tra modelli open-source e proprietari, proponiamo il pipeline di auto-correzione Decompose, Critique e Refine (DeCRIM), che migliora la capacità dei LLM di rispettare i vincoli. DeCRIM funziona decomponendo l'istruzione originale in una lista di vincoli e utilizzando un modello Critic per decidere quando e dove la risposta del LLM ha bisogno di essere perfezionata. I nostri risultati mostrano che DeCRIM migliora le prestazioni di Mistral del 7,3% su RealInstruct e dell'8,0% su IFEval anche con un feedback debole. Inoltre, dimostriamo che con un feedback forte, i LLM open-source con DeCRIM possono superare GPT-4 su entrambi i benchmark.

English

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.

LLM Autocorrezione con DeCRIM: Decomporre, Criticare e Affinare per un Miglior Seguimento delle Istruzioni con Molteplici Vincoli

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Abstract

Support