反向推理的见解：通过逆强化学习重建LLM训练目标

摘要

使用强化学习从人类反馈中训练的大型语言模型（LLMs）展示了卓越的能力，但它们的基本奖励函数和决策过程仍然不透明。本文通过应用逆强化学习（IRL）来解释LLMs的新方法。我们对不同规模的与毒性对齐的LLMs进行实验，提取能够达到高达80.40%准确率的预测人类偏好的奖励模型。我们的分析揭示了奖励函数的非可识别性、模型规模与可解释性之间的关系，以及强化学习人类反馈过程中潜在的陷阱。我们证明了通过IRL推导的奖励模型可以用于微调新的LLMs，在毒性基准测试中实现可比或改进的性能。这项工作为理解和改进LLM对齐提供了新视角，对这些强大系统的负责任开发和部署具有重要意义。

English

Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 80.40% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.

反向推理的见解：通过逆强化学习重建LLM训练目标

Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL

摘要

Support