推斷時間下大型語言模型的幾乎肯定安全對齊

摘要

即使是功能強大的大型語言模型（LLMs）也可能產生偏見或不安全的回應，而針對緩解此問題的對齊技術，如RLHF，因重新訓練LLM而昂貴且容易過度擬合。本文介紹了一種新穎的推論時對齊方法，確保LLMs幾乎肯定生成安全回應，即以接近一的概率。我們通過將推論時回應的安全生成框架定義為LLM潛在空間內的受限馬可夫決策過程來實現這一點。至關重要的是，我們增加了一個安全狀態，用於跟踪安全約束的演變，並使我們能夠在解決潛在空間中的MDP時展示正式的安全保證。基於這一基礎，我們提出了InferenceGuard，這是一種實用的實現，可以安全地對齊LLMs而無需修改模型權重。在實證方面，我們展示了InferenceGuard有效地平衡了安全性和任務性能，在生成安全且對齊的回應方面優於現有的推論時對齊方法。

English

Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.

推斷時間下大型語言模型的幾乎肯定安全對齊

Almost Surely Safe Alignment of Large Language Models at Inference-Time

摘要

Support