RL4F: Generazione di Feedback in Linguaggio Naturale con Apprendimento per Rinforzo per la Correzione degli Output del Modello

Abstract

Nonostante il loro successo senza precedenti, anche i modelli linguistici più grandi commettono errori. Similmente a come gli esseri umani apprendono e migliorano utilizzando feedback, lavori precedenti hanno proposto di fornire ai modelli linguistici feedback in linguaggio naturale per guidarli nella correzione dei loro output. Poiché ottenere critiche generate da esseri umani è costoso, i ricercatori hanno ideato generatori di critiche appresi come alternativa ai critici umani, assumendo che sia possibile addestrare modelli downstream a utilizzare il feedback generato. Tuttavia, questo approccio non è applicabile a modelli black-box o ad accesso limitato come ChatGPT, poiché non possono essere sottoposti a fine-tuning. Inoltre, nell'era dei grandi agenti linguistici general-purpose, il fine-tuning non è né computazionalmente né spazialmente efficiente, in quanto comporta la creazione di molteplici copie della rete. In questo lavoro, introduciamo RL4F (Reinforcement Learning for Feedback), un framework collaborativo multi-agente in cui il generatore di critiche viene addestrato per massimizzare le prestazioni su un task finale di GPT-3, un modello fisso più di 200 volte più grande. RL4F produce critiche che aiutano GPT-3 a rivedere i suoi output. Studiamo tre dataset per la pianificazione di azioni, la sintesi e l'alfabetizzazione e mostriamo miglioramenti (~5% in media) in molteplici metriche di similarità testuale rispetto a baseline solide in tutti e tre i task.

English

Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited access models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, we introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs. We study three datasets for action planning, summarization and alphabetization and show improvements (~5% on average) in multiple text similarity metrics over strong baselines across all three tasks.

RL4F: Generazione di Feedback in Linguaggio Naturale con Apprendimento per Rinforzo per la Correzione degli Output del Modello

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Abstract

Support