回應微調：在沒有指導的情況下對齊大型語言模型

摘要

指令調整-使用指令-回應對進行監督微調-是將預先訓練的大型語言模型（LLMs）轉換為有用且安全的聊天助手的基礎步驟。我們的假設是，建立適當的輸出空間可以使這種轉變成為可能，鑒於預先訓練的LLMs固有的能力。為了驗證這一點，我們提出了回應調整（RT），它消除了指令調整中的指令條件步驟，僅專注於回應空間監督。我們的實驗表明，僅使用回應進行訓練的RT模型可以有效地回應各種指令，並展現出與其經過指令調整的對應物相當的幫助性。此外，我們觀察到，控制訓練回應分佈可以顯著改善他們的使用者偏好，或引發目標行為，例如拒絕對不安全查詢的協助。我們的研究結果闡明了在調整中建立適當的輸出空間的作用，突顯了預先訓練的LLMs固有能力的潛力。

English

Instruction tuning-supervised fine-tuning using instruction-response pairs-is a foundational step in transitioning pre-trained Large Language Models (LLMs) into helpful and safe chat assistants. Our hypothesis is that establishing an adequate output space can enable such a transition given the capabilities inherent in pre-trained LLMs. To verify this, we propose Response Tuning (RT), which eliminates the instruction-conditioning step in instruction tuning and solely focuses on response space supervision. Our experiments demonstrate that RT models, trained only using responses, can effectively respond to a wide range of instructions and exhibit helpfulness comparable to that of their instruction-tuned counterparts. Furthermore, we observe that controlling the training response distribution can significantly improve their user preference or elicit target behaviors such as refusing assistance for unsafe queries. Our findings illuminate the role of establishing an adequate output space in alignment, highlighting the potential of the extensive inherent capabilities of pre-trained LLMs.

回應微調：在沒有指導的情況下對齊大型語言模型

Response Tuning: Aligning Large Language Models without Instruction

摘要

Support