RL4F:使用強化學習生成自然語言反饋以修復模型輸出
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
May 15, 2023
作者: Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, Niket Tandon
cs.AI
摘要
儘管最大的語言模型取得了前所未有的成功,但它們仍然會犯錯。與人類透過反饋學習和改進相似,先前的研究提出為語言模型提供自然語言反饋,以引導它們修正輸出。由於人類生成的評論很昂貴,研究人員提出了學習評論生成器的方法,以取代人類評論家,同時假設可以訓練下游模型利用生成的反饋。然而,這種方法不適用於黑盒或有限訪問權限的模型,例如ChatGPT,因為它們無法進行微調。此外,在大型通用語言代理的時代,微調既不具備計算效率,也不具備空間效率,因為這將導致網絡的多個副本。在這項工作中,我們介紹了RL4F(Reinforcement Learning for Feedback),這是一個多智能體協作框架,其中評論生成器被訓練以最大化GPT-3的最終任務表現,GPT-3是一個固定模型,其大小超過200倍。RL4F生成的評論有助於GPT-3修改其輸出。我們研究了三個用於行動計劃、摘要和字母排序的數據集,並展示了在所有三個任務中相對於強基線的多個文本相似性指標上的改進(平均約5%)。
English
Despite their unprecedented success, even the largest language models make
mistakes. Similar to how humans learn and improve using feedback, previous work
proposed providing language models with natural language feedback to guide them
in repairing their outputs. Because human-generated critiques are expensive to
obtain, researchers have devised learned critique generators in lieu of human
critics while assuming one can train downstream models to utilize generated
feedback. However, this approach does not apply to black-box or limited access
models such as ChatGPT, as they cannot be fine-tuned. Moreover, in the era of
large general-purpose language agents, fine-tuning is neither computationally
nor spatially efficient as it results in multiple copies of the network. In
this work, we introduce RL4F (Reinforcement Learning for Feedback), a
multi-agent collaborative framework where the critique generator is trained to
maximize end-task performance of GPT-3, a fixed model more than 200 times its
size. RL4F produces critiques that help GPT-3 revise its outputs. We study
three datasets for action planning, summarization and alphabetization and show
improvements (~5% on average) in multiple text similarity metrics over strong
baselines across all three tasks.