通過自我對話為基於LLM的任務導向對話代理程序引導自助式學習

摘要

大型語言模型（LLMs）是強大的對話代理，但將它們專注於實現特定功能可能具有挑戰性。指導調整，即在人類生成的指導和樣本回應上調整模型（Ouyang等，2022年），已被證明是一種有效的方法，但需要大量數據樣本，可能無法獲得或產生成本高昂。此外，當目標是使LLM遵循對話中的特定工作流程而不僅僅是單個指令時，這種成本會增加。受強化學習中的自我對弈技術和使用LLMs模擬人類代理的啟發，我們提出了一種更有效的通過LLMs在不同角色中進行對話來進行數據收集的方法。這種方法通過LLMs的“自我對話”生成訓練數據，可以進行改進並用於監督微調。我們介紹了一種自動化的方法來衡量對話的（部分）成功。該指標用於過濾生成的對話數據，然後餵入LLM進行訓練。根據我們對對話質量的自動化和人工評估，我們證明了這種自我對話數據改進了結果。此外，我們檢驗了展示生成對話質量的各種特徵以及它們如何與作為訓練數據的潛在效用相連接。

English

Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Furthermore, this cost increases when the goal is to make the LLM follow a specific workflow within a dialogue instead of single instructions. Inspired by the self-play technique in reinforcement learning and the use of LLMs to simulate human agents, we propose a more effective method for data collection through LLMs engaging in a conversation in various roles. This approach generates a training data via "self-talk" of LLMs that can be refined and utilized for supervised fine-tuning. We introduce an automated way to measure the (partial) success of a dialogue. This metric is used to filter the generated conversational data that is fed back in LLM for training. Based on our automated and human evaluations of conversation quality, we demonstrate that such self-talk data improves results. In addition, we examine the various characteristics that showcase the quality of generated dialogues and how they can be connected to their potential utility as training data.

通過自我對話為基於LLM的任務導向對話代理程序引導自助式學習

Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

摘要

Support