反向推理的见解:通过逆强化学习重建LLM训练目标
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
October 16, 2024
作者: Jared Joselowitz, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
cs.AI
摘要
使用强化学习从人类反馈中训练的大型语言模型(LLMs)展示了卓越的能力,但它们的基本奖励函数和决策过程仍然不透明。本文通过应用逆强化学习(IRL)来解释LLMs的新方法。我们对不同规模的与毒性对齐的LLMs进行实验,提取能够达到高达80.40%准确率的预测人类偏好的奖励模型。我们的分析揭示了奖励函数的非可识别性、模型规模与可解释性之间的关系,以及强化学习人类反馈过程中潜在的陷阱。我们证明了通过IRL推导的奖励模型可以用于微调新的LLMs,在毒性基准测试中实现可比或改进的性能。这项工作为理解和改进LLM对齐提供了新视角,对这些强大系统的负责任开发和部署具有重要意义。
English
Large language models (LLMs) trained with Reinforcement Learning from Human
Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying
reward functions and decision-making processes remain opaque. This paper
introduces a novel approach to interpreting LLMs by applying inverse
reinforcement learning (IRL) to recover their implicit reward functions. We
conduct experiments on toxicity-aligned LLMs of varying sizes, extracting
reward models that achieve up to 80.40% accuracy in predicting human
preferences. Our analysis reveals key insights into the non-identifiability of
reward functions, the relationship between model size and interpretability, and
potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward
models can be used to fine-tune new LLMs, resulting in comparable or improved
performance on toxicity benchmarks. This work provides a new lens for
understanding and improving LLM alignment, with implications for the
responsible development and deployment of these powerful systems.Summary
AI-Generated Summary